After many days of tinkering, the new search server has passed its tests and is nearly ready for deployment next week. After my last post on the search services, there were lots of questions, so I’ll give some more history on why I’m working on this now:
- The old Lucene based search services worked well, but installing them was a major pain. Installing compilers by hand, sacrificing chickens and hoping that things would work wasn’t my idea of fun.
- Lucene has a philosophy of working out of the box without significant tweaks. That’s great if you’re indexing a bunch of text, but indexing music metadata from an SQL database is a bit of a different beast. The usual Lucene tricks didn’t work so well for us, so we couldn’t tweak it to work better for us. Xapian requires a little more tuning out of the box, but our search results are much better now than they were before.
- Sending metadata lookup traffic to a service like Xapian is generally a good idea, as a single Xapian server can handle lookup traffic more elegantly than a Postgres database. And adding more search servers is easier than adding more database servers.
- Our traffic is growing — I expect us to handle twice as much traffic in July as we did the July before. A lot of this traffic growth is coming from people using our web-service to look up music. If the web-service slows down, the rest of the site slows down as well. So I’m trying to stay ahead of the curve an anticipate when we reach capacity and be able to add more machines as necessary
As of next week, MusicBrainz will have twice as much rack-space (20U’s of space!) and we can finally rack the two new servers that were donated a few months ago. Fortunately due to dropping bandwidth costs, this new space doesn’t really come at a greater expense to us — I expect our hosting costs to stay nearly the same as they are now. (about $1000/mo, btw)
This will allow us to have 3 times the search capacity we have now, which should keep the site working for a while longer. In fall I hope to start moving our web-service to Amazon’s EC2 service, which should allow us to get as much capacity as we need.
As soon as I get the new search services deployed I’m putting my head down and coding the next server update. So, keep your fingers crossed that this process goes smoothly.
I am very familiar with Lucene, and Solr too. You speak of difficulties with Lucene but don’t get into any details. Could you please offer more insite into your troubles? (forgive me if this is elsewhere; I looked)
check out sphinx search. many high traffic sites use it (craigilst).
Thanks for mentioning this — I’ll have to check it out!