UPDATE: Thanks to user selckin in the #lucene IRC channel for quickly solving this for us! Hopefully we can put this fix into production later today!
As our regular readers may know, we’ve been having lots of troubles with our lucene based search servers. Over the past few days we’ve spent a fair amount of time, tuning, debugging and otherwise trying to troubleshoot our setup. We’ve fixed and identified a number of problems, but most importantly we feel that we’ve identified the core issue: Our servers are simply overloaded.
Under normal conditions we find our servers loaded to about 25% – 35% CPU — things look good and we don’t think we have a capacity problem with our servers. Then a slow query comes in that starts to slow things down. Much like a traffic jam that evolves out of thin air, one slow query can make a giant mess for everyone.
We’ve started timing our queries and most of the time, they can be measured in milliseconds. However, when things get bad, they may take up to 7-8 seconds. Our upstream web servers time out on the search request after about 5 seconds in order to prevent traffic from getting backed-up. What we need to do next is to limit the duration that a lucene query can run and terminate it after the timeout.
I’ve started looking at this and quickly realized that this is much more of a job than adding a simple timeout parameter to the search call. We’re currently using this search function from IndexSearcher:
public TopDocs search(Query query, int n);
Ideally I would like to add a way to timeout queries after 3 seconds. So far, I’ve discovered that we could use
public void search(Query query, Collector results)
with a TimeLimitedCollector. The old call returns TopDocs and our code assumes that we have a TopDocs object from which to cull our search results. Having stared at the docs for lucene for a while, I haven’t found an way to convert the data in TimeLimitedCollector and convert it to TopDocs. It doesn’t make sense to me. 😦
How does one do this? Sadly, we have no Java programmers on our team, so we’re quite a bit out of our league here. Is there an easier way to do this? Would someone be willing to write this code for us and submit a PR? We’d find some really good chocolate and send it to you if you do!
More info on our project:
- This project provides critical search functions for MusicBrainz.
- The source lives here
- My attempt at converting this code to use the TimeLimitedCollector is here: https://bitbucket.org/metabrainz/search-server/commits/ce00b13b799c1e69e24fa87299342144ec481674
We are using Lucene 4.10.4 on a custom codebase that pre-dates SOLR — we have a new SOLR project to replace this one, but it isn’t quite done yet. (Again, not having Java programmers is a bit of a problem for us).
Any tips, explanations or pull requests would be deeply appreciated! Chocolate reward offered!
Thank you!