Server capacity update

Zas and I have been working hard to improve the capacity and stability of the site. In the last week, we’ve identified and fixed at least 3 problems with the search servers and we’ve added a timeout function that times out queries that take longer than 3 seconds. We think that the main cause of trouble was that queries were piling up after a slow query ran too long and that the servers never recovered from that and consequently crashed.

We won’t go as far as saying that the search servers are fixed — every time we have a smidgen of hope that things are improving, they crash again. Seemingly out of spite! So, the search servers are better. 😉

Zas has also made a number of changes to the gateways and how we rate limit our incoming traffic. The rate limiting is now being done in a smarter way that reduces the overall traffic on our web servers. Well done!

We’ve also increased our bandwidth budget by 4mbits per second, which makes the site feel considerably more responsive.

Let me put these improvement into numbers: About a week ago were were struggling to keep up 250 requests per second and the site felt very sluggish. Now we can handle 500 requests a second and the site feels considerably faster. For large chunks of the day we are managing to handle all the traffic we should handle. And, the search servers haven’t crashed in 4 days!

We hope that this will give us a solid base from which to release the scheme upgrade tomorrow. Then once that is complete, we will start work on moving to the new hosting company.

Thanks for being patient with us!

Sophie Goossens joins the MetaBrainz board of directors (and more!)

I’m pleased to announce that Sophie Goossens, an attorney in London, has joined the board of directors of the MetaBrainz Foundation. Sophie specializes in intellectual property law and has ties to the European Commission, which makes her a great addition to our board of directors.

Welcome to our board of directors, Sophie!

Sophie replaces Carol Smith who decided to move on from the board after leaving her position as the head of Google’s Summer of Code program. Carol joined us in late 2009 and has held the position as treasurer & secretary since then. Two years after joining us, she became a full director in early 2011.

Thank you for everything you’ve done for MetaBrainz in the past 6+ years, Carol!

Last, but not least, we needed to fill the Secretary/Treasurer slots that were vacated by Carol. Luckily for us, our business development manager Christina Smith stepped up to those duties and was voted onto the board back in February. (Now that all of these changes are complete, we can publicly speak about them.)

Thank you for taking on these two positions, Christina. I’m also quite happy that we’ve preserved the balance of people with the last name Smith in our board. 🙂

Thanks Sophie, Carol and Christina!

Help! Is there a Lucene doctor in the house?

UPDATE: Thanks to user selckin in the #lucene IRC channel for quickly solving this for us! Hopefully we can put this fix into production later today!

As our regular readers may know, we’ve been having lots of troubles with our lucene based search servers. Over the past few days we’ve spent a fair amount of time, tuning, debugging and otherwise trying to troubleshoot our setup. We’ve fixed and identified a number of problems, but most importantly we feel that we’ve identified the core issue: Our servers are simply overloaded.

Under normal conditions we find our servers loaded to about 25% – 35% CPU — things look good and we don’t think we have a capacity problem with our servers. Then a slow query comes in that starts to slow things down. Much like a traffic jam that evolves out of thin air, one slow query can make a giant mess for everyone.

We’ve started timing our queries and most of the time, they can be measured in milliseconds. However, when things get bad, they may take up to 7-8 seconds. Our upstream web servers time out on the search request after about 5 seconds in order to prevent traffic from getting backed-up. What we need to do next is to limit the duration that a lucene query can run and terminate it after the timeout.

I’ve started looking at this and quickly realized that this is much more of a job than adding a simple timeout parameter to the search call. We’re currently using this search function from IndexSearcher:

  public TopDocs search(Query query,  int n);

Ideally I would like to add a way to timeout queries after 3 seconds. So far, I’ve discovered that we could use

  public void search(Query query, Collector results)

with a TimeLimitedCollector. The old call returns TopDocs and our code assumes that we have a TopDocs object from which to cull our search results. Having stared at the docs for lucene for a while, I haven’t found an way to convert the data in TimeLimitedCollector and convert it to TopDocs. It doesn’t make sense to me. 🙁

How does one do this? Sadly, we have no Java programmers on our team, so we’re quite a bit out of our league here. Is there an easier way to do this? Would someone be willing to write this code for us and submit a PR? We’d find some really good chocolate and send it to you if you do!

More info on our project:

We are using Lucene 4.10.4 on a custom codebase that pre-dates SOLR — we have a new SOLR project to replace this one, but it isn’t quite done yet. (Again, not having Java programmers is a bit of a problem for us).

Any tips, explanations or pull requests would be deeply appreciated! Chocolate reward offered!

Thank you!

State of the Onion: MetaBrainz

In the past few weeks we’ve been hit with several traffic increases to MusicBrainz which is putting considerably more strain on our aging infrastructure than we’re happy with. If it seems that we’re not doing anything about it, that is because we’ve been busy behind the scenes trying to keep things moving forward. This sometimes doesn’t leave us a lot of time to keep the public informed on our work. Hopefully this blog post will fix this in the short term:

In 2011 we started to make plans to move MusicBrainz hosting into the cloud, but then out of the blue we were donated a pile of machines. There were so many machines that I postponed the cloud plans and prepared the donated machines for service. That has carried us for 4+ years with almost no hardware cost, which was really great. The plan was to move to the cloud sometime around 2015, but then I spent most of 2014/2015 dealing with conflicts in the team, putting us seriously behind schedule while our hardware decayed.

On top of that, we’ve recently had some “bad luck”. We have had some disrespectful commercial customers hit us really hard and we had to find and block them. We have had unexpected traffic spikes and when trying to address these unexpected traffic spikes, we had two more machines fail on us. These were the donated machines that we kept in reserve for just this moment. The loss of two machines caught us short on capacity to handle the increased demands on our servers.

So, now we face the tough question: Do we buy expensive hardware that we might use for 6 months (~$5000) or do we try and save the money and tough it out? I’d rather not spend so much money on such short term use if we can avoid it. We’re going to try and move to a new hosting facility somewhere in the EU, since that is where most of our users are.

Moving to a new hosting facility has an incredible number of dependencies that Christina (our Biz Dev manager), Zas and I have been working through. It may not seem like we have a plan, but we do, and we’re incredibly busy trying to make the plan happen. To give you a taste of what we’re up against:

  1. We want to move our hosting to Europe and have a business presence in Europe in order to reduce the costs and inefficiencies of being a solely US based business. A lot of our traffic, customers and contractors are in the EU and it simply makes sense to have a presence here.
  2. To establish a presence in the EU I needed local help to help with the business matters as well as researching and establishing an EU organization. So I needed to find a Biz Dev manager and that person is Christina.
  3. Once Christina was on board she researched our options about what was suited for us. Getting that process moving involved getting certified documents from California, board approval for spending funds to establish the organization, EU labor law research, (and we needed to swap a board member, too!), hiring help to establish the org. and generally navigating the Spanish bureaucracy. (See this only slightly exaggerated short film for some clues of our ordeal.)
  4. Once the org. had been established we needed to convince the bank to open a bank account for us. The draconian US banking laws extend worldwide and the local bank had to ensure that they were not opening themselves up to thousands of $$$ in accounting hassles just to allow a tiny non-profit to open a bank account. We finally have a bank account and have started paying our contractors with it!
  5. At the same time we’re also working to set up an office for the growing team here in Barcelona. That required a byzantine process that barely started when you sign the lease. Getting power, internet and water set up has taken a frustratingly long time. Had I known how long, I would have stayed at my co-working space for a while longer while addressing hosting issues.
  6. While Christina has been focused on the hardcore paperwork, Zas is keeping the site running, which itself requires many heroics. Zas and I have started planning the move to the EU hosting provider. We’ve got a 5-page document that collects some of the open questions and requirements around this process: https://docs.google.com/document/d/16KNm4KksNwz29Opk1aILOMtCmPIeXFuxxUjMoPT3th0/edit#heading=h.dpfvoz1idcro. Right now Zas and Bitmap are here in Barcelona and we’re going to work on establishing a formal plan for moving to the new hosting company. We’re currently comparing hosting company offerings – see what we’ve collected so far if you care to follow along. The amount of work required to make this happen is making my head hurt. (A special shoutout to KodeStar, lead developer of FanArt.tv, for providing a lot of useful feedback about our various options.)
  7. While Christina, Zas and I have our hands full, Bitmap and Gentlecat continue to release new features and work on the schema change. Not to mention all the contributions from Freso and Reosarevok to keep the community happy and polite while we deal with less than optimal site conditions. That said, I am really happy and proud of my team, trying to keep things running in sub-optimal conditions.

This is just a snapshot of everything that is happening behind the scenes that will culminate with the goal of moving to a new hosting company and being set up in the EU. And mind you, we’re doing this with a minuscule budget trying to be careful of how we spent our money.

Important: Schema change delayed to May 23

With our ongoing hosting issues due to massive traffic increases and failing hardware we’ve been too distracted trying to manage those issues to finish all of the testing for the schema change release that was scheduled for today.

We deeply regret having to do this, but we’re going to delay the schema change release by a week. It is now scheduled for May 23, 2016. This week long delay will give us a chance to further tweak our server configuration (more on this in the next blog post) and to test the schema change release in much more detail.

We are, however, going to upgrade our database server to Postgres 9.5 either later today or tomorrow. During this upgrade we are going to employ a back-up database server and keep MusicBrainz running in read-only mode with a slightly reduced overall capacity (I’m sure everyone know what that means by now). This upgrade should have no other effects on our downstream data users.

We will give people plenty of notice before we start the postgres upgrade via our site banner and via our Twitter account (@musicbrainz).

Sorry for the continued drama affecting our services — we’re working hard to keep things together!

Important information about the May 16 schema change release

In the past few weeks we’ve been hit with massive increases in traffic and a couple of hardware failures. Trying to maintain a decent service quality in light of both of these events have taken a lot of time of our team and we don’t feel 100% confident about the schema change release tomorrow.

Fortunately, the entire team will be together in one place tomorrow. The first thing we’re going to do is review the current state of affairs and decide how to tackle the Postgres upgrade and the release. As soon as we have our plan put together, we will post an updated blog entry with all of the needed details. But, we may very well delay the release by 24 hours.

However, we found that we ran out of time on one feature: MBS-6024: Support more than one barcode on same release. This one ticket will not be included in the upcoming release. We’re really sorry for letting that one issue slip — sorry for any inconvenience this may cause you.

6 degrees of Vince Gill

I’m not sure that we’ve talked about this cool project yet, so I’ll catch up on that now. The new site Six Degrees of Vince Gill allows you to enter an artist name and see how many degrees of separation there are between your artist and Vince Gill. This project comes from Universal Music’ Nashville group — I’m happy to see our data get used in interesting ways like this!

Now, if you want to see someone relate to Vince Gill in seven degrees, have a look at how I relate to him. 🙂

Screen Shot 2016-03-16 at 17.45.29.

 

 

May 2016 schema change release details

In about two months time we’ll have the next schema change release: May 16, 2016. Even after skipping the fall schema change release, this release is going to have few changes that will impact our downstream users. Most of the tickets in this release will make minor improvements to database indexes and edit tables. If you are one of the few users of our edit data, then you should delve deeper into the list of tickets in this release. For everyone else, I will summarize the tickets with a greater impact.

In a previous blog post we also talked about upgrading the minimum required version of postgres. We received no real feedback requesting for us to upgrade to 9.4, but we did receive some feedback that some people would prefer 9.5, which is our preference as well. Based on that feedback, we’re going to make PostgreSQL version 9.5 the minimum required version. If you’d like to run a MusicBrainz replicated instance via our Live Data Feed, you will need to run Postgres 9.5!

The official minimum supported Ubuntu release as of now is still Ubuntu 10.04 LTS (Lucid Lynx) which reached end-of-life a year ago. We will upgrade that to Ubuntu 14.04 LTS (Trusty Tahr) at the schema change release. In particular, this means that we might start using Perl 5.18 features in the MusicBrainz Server code (as opposed to Perl 5.10 currently).

We understand that this is potentially a lot of work for some of our users, but occasionally we need to upgrade our requirements. We try and limit these sorts of upgrades as much as possible, so please bear with us.

Finally onward to the details of the release. Please take a look at the list of issues that will be addressed in this release. The few tickets worth discussing in details are:

  • MBS-8838 – “Add gids to all *_type tables“. This ticket adds MBIDs (GIDs in schema lingo) to all of our tables that define a type for some database element. Given that we recommend that external users never reference our data by row ids, we really need to provide proper permanent MBIDs to all elements of our database.
  • MBS-6024 – “Support more than one barcode on same release (SQL edition)“. This ticket adds the ability for the database to contain more than one barcode for a given release. However, this ticket does not include the user interface portions of this feature. The team will add the user interface/edit portions of this feature in a later, non schema change release.
  • MBS-4501 – “Alternative tracklists“. This ticket creates a new feature that would allow an alternative tracklist to be used for a given release. This is a better solution for handling conflicts between our style guidelines and how the data appears on the release. It is also a more elegant solution for translations of releases into different languages.

As usual, we will post final details about the release shortly before the release happens. If you have any questions about this release, feel free to ask specific questions in the tickets or general questions in the comments below.

(Edited 2016-03-16 at 12:55 UTC to add the upgraded Ubuntu requirement.)

Wary of the Web Sheriff

On September 29th, we received an email from a company called Web Sheriff, urgently requesting us to remove the name Adetayo Ayowale Onile-Ere from the Taio Cruz artist page. We investigated this request and quickly found that there were other resources on the net referencing both names.  We also found other evidence of Web Sheriff working to change the name of this artist in other venues/sites.

One of the cornerstone’s of our music database, and our principles, is to keep the data as clean and accurate as possible.  We aim to edit with due diligence.

We declined to make the change, and made the following request: If you can provide us with a birth certificate that shows Taio Cruz as the birth name, we’d be happy to make the change. Web Sheriff agreed to do that and on October 13th we received a copy of this document:

JTC - Birth Certificate

 

I inspected the document and quickly felt something was amiss. The document purports to be a Birth Certificate from the Chelsea and Westminster Hospital in London, UK and the father’s occupation is listed as “Lawyer”. While I am not a legal expert, my understanding is that the term for someone who practices law in the UK is not lawyer, but rather attorney, barrister, or solicitor.  The use of “Lawyer” in this context seemed strange to me.

With this observation as my motivation, I rang up Her Majesty’s government to ask how I would go about verifying the validity of a birth certificate. I was told that the UK government could not verify the authenticity of a certificate, but that I could request a copy of the certificate myself since they are public record. For a fee of 9 pounds 25p and two weeks time I would receive a copy of the certificate in email. I thought that was a good way forward and I asked to order a copy. The lovely lady who helped me, painstakingly endeavoured to ensure that all the details related to my request were communicated correctly. I have no doubts that I relayed the data I had accurately.

And with that, I waited. On November 8th, I received mail from Her Majesty’s government informing me that no such birth certificate could be found and that my payment will soon be refunded.

This strongly suggests that the document provided to us by Web Sheriff was not a legitimate copy of a birth certificate for Jacob Taio Cruz.

Being coerced to make changes to facts in our database based on false pretences –  I find such things despicable. We felt harassed and intimidated by the efforts of Web Sheriff, pressuring us to make this change.

I hope that this blog post will remind others to be wary and vigilant, and not give in easily to coercion.  This is what I intend to do with Web Sheriff, and others, going forward and I hope you will do the same!

For more info on Web Sheriff see their Wikipedia page.

 

UPDATE (March 9, 2016):  This post has been edited for clarity.  These changes follow a specific request from Web Sheriff that we delete or amend the post:

we must insist that you withdraw your false allegation of forgery

UPDATE: Since posting this, sharp readers spotted more problems with the certificate:

  1. There is no “county of Westminster”. There is a city of Westminster.
  2. The hospital name is missing a “t”: Wesminster.
  3. This hospital may not have been called that in 1985. It seems to have existed in that form since about 1993.

Upgrading Postgres for MusicBrainz Live Data Feed users

We’re slowly approaching that time of year: Schema change release time. After skipping our fall update to focus on some internal tasks, we’re ready to have another schema change release in the spring: May 16, 2016

We have started the process to collect features we wish to release for this schema change release and we’ll be publishing that list in the coming weeks. However, we’re contemplating the impact of one more change we’d like to make: Upgrading to a more recent version of Postgres.

Internally we are going upgrade to Postgres 9.5, which was recently released, so we expect that the Postgres team will have worked out the most significant kinks before we’re ready to move to it. However, even though we are moving to 9.5, we are considering the impact on our downstream users/customers who need to make the same or similar change.

While we are moving to version 9.5 of Postgres, we have the option of only adopting features from Postgres 9.4, which means that our downstream users may continue to use Postgres 9.4. However, Postgres 9.5 has some nice features we’d like to use (e.g. UPSERT), so we’re pondering if it is possible for us to require Postgres 9.5 from all of ours Live Data Feed users starting on May 16, 2016. 

We have already informally queried a few of ours users and so far it seems that requiring Postgres 9.5 is feasible. If you are a Live Data Feed user and feel that this requirement of Postgres 9.5 is too much for your and your organization by May 16, 2016, please leave a comment to this blog post!