Massive connectivity issues

As you are probably aware, we’ve been having lots of network connectivity issues with all services hosted at Digital West in California (all of our projects, except ListenBrainz and AcousticBrainz).

Today we spent all morning trying to replace what we thought to be a faulty switch. That process didn’t go very well at all – we hit every conceivable issue that we could’ve hit. And a few more.

But, in this process we connected our gateway machines directly to our uplink (not through our switch) and the network issues persisted! After testing this setup with both of our machines, we’ve now conclusively eliminated all of our equipment as the possible source of trouble.

At this point our troubles lie in the hands of Digital West to fix. Thankfully the day staff will return to work in a few hours and hopefully we will make some progress on this issue then.

Sorry for all of this hassle. 😦

Updated search jar/war files

Given the utter slackers we are, we haven’t yet finished updating the search server to output the new MBIDs that were added to some entities in our last release. We’ll try and get that done soonish.

However, we did update the search code to fix this error in the search indexer:

ERROR: type “earth” does not exist

I’ve put both of these jar/war files on our FTP site:

If you would like to try and build these from source, you’ll need commit 4f677727 from mmd-schema and the latest master commit from search-server. For instructions on how to build this, please follow these instructions.

UPDATE: The build from the current master for search-server appears to not be able to load indexes upon startup. Please use the old war (we still use this in production) until we can release a fix.

Schema change update

We’ve finally completed the schema update and things are returning to normal. We need to get a new data dump out and then we will provide upgrade instructions tomorrow. As you might be able to guess, unless you are already on Postgres 9.5, we are going to recommend a clean data import, rather than a migration, if you have a replicated slave.

And, if anyone even dare ask (within the next week) when an updated VM will be released, you owe the whole development team each 2 bars of high quality chocolate.

Feeling lucky, punk?

P.S. Can you tell we’ve been up too long? 🙂

Server capacity update

Zas and I have been working hard to improve the capacity and stability of the site. In the last week, we’ve identified and fixed at least 3 problems with the search servers and we’ve added a timeout function that times out queries that take longer than 3 seconds. We think that the main cause of trouble was that queries were piling up after a slow query ran too long and that the servers never recovered from that and consequently crashed.

We won’t go as far as saying that the search servers are fixed — every time we have a smidgen of hope that things are improving, they crash again. Seemingly out of spite! So, the search servers are better. 😉

Zas has also made a number of changes to the gateways and how we rate limit our incoming traffic. The rate limiting is now being done in a smarter way that reduces the overall traffic on our web servers. Well done!

We’ve also increased our bandwidth budget by 4mbits per second, which makes the site feel considerably more responsive.

Let me put these improvement into numbers: About a week ago were were struggling to keep up 250 requests per second and the site felt very sluggish. Now we can handle 500 requests a second and the site feels considerably faster. For large chunks of the day we are managing to handle all the traffic we should handle. And, the search servers haven’t crashed in 4 days!

We hope that this will give us a solid base from which to release the scheme upgrade tomorrow. Then once that is complete, we will start work on moving to the new hosting company.

Thanks for being patient with us!

Sophie Goossens joins the MetaBrainz board of directors (and more!)

I’m pleased to announce that Sophie Goossens, an attorney in London, has joined the board of directors of the MetaBrainz Foundation. Sophie specializes in intellectual property law and has ties to the European Commission, which makes her a great addition to our board of directors.

Welcome to our board of directors, Sophie!

Sophie replaces Carol Smith who decided to move on from the board after leaving her position as the head of Google’s Summer of Code program. Carol joined us in late 2009 and has held the position as treasurer & secretary since then. Two years after joining us, she became a full director in early 2011.

Thank you for everything you’ve done for MetaBrainz in the past 6+ years, Carol!

Last, but not least, we needed to fill the Secretary/Treasurer slots that were vacated by Carol. Luckily for us, our business development manager Christina Smith stepped up to those duties and was voted onto the board back in February. (Now that all of these changes are complete, we can publicly speak about them.)

Thank you for taking on these two positions, Christina. I’m also quite happy that we’ve preserved the balance of people with the last name Smith in our board. 🙂

Thanks Sophie, Carol and Christina!

Help! Is there a Lucene doctor in the house?

UPDATE: Thanks to user selckin in the #lucene IRC channel for quickly solving this for us! Hopefully we can put this fix into production later today!

As our regular readers may know, we’ve been having lots of troubles with our lucene based search servers. Over the past few days we’ve spent a fair amount of time, tuning, debugging and otherwise trying to troubleshoot our setup. We’ve fixed and identified a number of problems, but most importantly we feel that we’ve identified the core issue: Our servers are simply overloaded.

Under normal conditions we find our servers loaded to about 25% – 35% CPU — things look good and we don’t think we have a capacity problem with our servers. Then a slow query comes in that starts to slow things down. Much like a traffic jam that evolves out of thin air, one slow query can make a giant mess for everyone.

We’ve started timing our queries and most of the time, they can be measured in milliseconds. However, when things get bad, they may take up to 7-8 seconds. Our upstream web servers time out on the search request after about 5 seconds in order to prevent traffic from getting backed-up. What we need to do next is to limit the duration that a lucene query can run and terminate it after the timeout.

I’ve started looking at this and quickly realized that this is much more of a job than adding a simple timeout parameter to the search call. We’re currently using this search function from IndexSearcher:

  public TopDocs search(Query query,  int n);

Ideally I would like to add a way to timeout queries after 3 seconds. So far, I’ve discovered that we could use

  public void search(Query query, Collector results)

with a TimeLimitedCollector. The old call returns TopDocs and our code assumes that we have a TopDocs object from which to cull our search results. Having stared at the docs for lucene for a while, I haven’t found an way to convert the data in TimeLimitedCollector and convert it to TopDocs. It doesn’t make sense to me. 😦

How does one do this? Sadly, we have no Java programmers on our team, so we’re quite a bit out of our league here. Is there an easier way to do this? Would someone be willing to write this code for us and submit a PR? We’d find some really good chocolate and send it to you if you do!

More info on our project:

We are using Lucene 4.10.4 on a custom codebase that pre-dates SOLR — we have a new SOLR project to replace this one, but it isn’t quite done yet. (Again, not having Java programmers is a bit of a problem for us).

Any tips, explanations or pull requests would be deeply appreciated! Chocolate reward offered!

Thank you!