The 2019 MetaBrainz Summit took place on 27th–29th of September 2019 in Barcelona, Spain at the MetaBrainz HQ. The Summit is a chance for MetaBrainz staff and the community to gather and plan ahead for the next year. This report is a recap of what was discussed and what lies ahead for the community.Continue reading “State of the Brainz: 2019 MetaBrainz Summit highlights”
The new search is live on MusicBrainz with this server update, as announced in previous blog post. This release also continues the rewrite to React, improves and fixes the handling of external URLs. The git tag is v-2018-06-30.
- [MBS-9736] – Convert the artist search results page to React
- [MBS-8334] – Digest auth with username containing non-ascii characters fails
- [MBS-9730] – Cannot link to a Bandcamp Daily review page in release group relationships
- [MBS-9734] – Inconsistency between the JSON search API and the lookup/browse one in ws/2/
- [MBS-9742] – Some Library of Congress URLs are not recognised
- [MBS-9743] – Beatport URL cleanup fails for names starting with digit
- [MBS-9408] – Add Juno Download links to the sidebar
- [MBS-9439] – Make changing a URL between HTTP and HTTPS an autoedit
- [MBS-9740] – Update Facebook URL cleanup
Many people thought this day would never come. 🙂
Hey folks, samj1912 here again o/
As you might know, we recently did a massive upgrade of our search infrastructure. If you have not been following our Solr updates, definitely check out our other blog detailing our search server journey and the improvements and changes that come with the new search.
We have had a beta run with Solr this last week and fixed most of the show-stopping bugs. As such, we have been stress testing our Solr search by replaying our production logs on it, live.
Solr search seems to solvr almost all our qualms with search and as such, we have made the decision to use Solr for our production search servers.
The purpose of this blog post, as nicely worded by our BDFL Rob is –
Speak now or forever hold your pickle. In a week, the ole search servers gets it.
And it’s basically that, if you haven’t experimented with Solr search, please read our earlier blog to know what’s what. If you find any bugs, please report them on our ticket tracker. In case there are no new show-stoppers reported, that must absolutely be fixed before we switch to Solr on the main website, we will be killing the old search servers and replacing them with our brand new Solr ones in a week.
Apart from that, we have made a discourse thread to report any minor improvements in the search results.
Another thing, I’d like to remind everyone is that, with our switch to this new Solr infrastructure, the version 1 web service (ws/1)will soon be discontinued. As announced earlier, we will keep it alive till 31st July 2018 but it will get the axe on 1st August 2018, 12 pm GMT.
Hello people o/, samj1912 here.
I am extremely glad to announce that we are finally launching our Solr search on the MusicBrainz beta server!
Just a little history before I announce the new features and toys you get to play with:
Solr started as something that could replace our existing search infrastructure. If you have been a MusicBrainz user for a while, you might know that our search has quite an indexing latency and it takes as much as 3 hours for new edits to show up in the search results. In part because updating the search index involved doing an entire re-index of the database. With the high latency and the resources it took, the current search server left much to be desired.
Another area that our current search lacked in, was showing popular results and search ranking. Searching for a famous artist or place returned results that contained a lot of noise, and more often than not, contained results that weren’t relevant to what the user had in mind when they searched for it.
These were the two major problems that motivated us to shift to a better infrastructure for our search needs.
Thus, MB-Solr was born.
It has been in development for quite some time now. The coding for the project started with Mineo back in 2014 and was carried forward by Jeff Weeksio in GSoC 2015. But due to lack of development resources and other, more pressing needs, the project was put on a hold for a while, until Roman started working on it. However, he left MetaBrainz before he could finish this work, so when I joined the MetaBrainz team, the first and foremost task that was assigned to me was getting Solr working and ready for production.
After struggling with multiple moving parts and services, tons of issues with maintaining compatibility with our existing web-service API, rowing up and down multi-threading/processing hell, learning just enough about information retrieval to get our search relevance on point and countless hours sifting through Solr documentation to get our Solr cluster fine-tuned and running fast enough to keep up with our web traffic… we are finally here.
I am pretty sure I would’ve rage-quit dozens of times during this last year if I was doing this all alone.
As such, we have our trusty sysadmin Zas to thank for taking care of all the deployment needs and making sure Solr was well-tested (believe me we toyed with Solr like little kids in a sandbox) and wasn’t going to fail and wake him up 3 AM in the morning with red alerts all over. Mineo, Bitmap and Yvanzo were there, with much-needed code reviews and help with all things Solr and MusicBrainz. Our style leader Reosarevok, and CatQuest helped us test our new search relevance configuration. And of course, we had our BDFL, Rob over-seeing things and whipping them into shape (with chocolate and mismatched socks of course).
Anyway, here’s what you are here for:
- (Almost) Instantaneous search-index updates – Edit something and immediately see it in the search results. Say goodbye to that note you used to see below the search telling you that you have to wait. Who likes waiting anymore – seriously, it’s 2018.
- Better search results – We wanted to make sure you were getting the right Queen and London as the top result. You can finally link your favorite artist to London, UK as opposed to London, Arkansas. Don’t believe me? Go try it out.
- Less load on our servers – Meaning we can serve more of your requests, faster. Getting tired of waiting for tagging your bajillion songs in Picard? Well, you still gotta wait, but less so, now that we are better equipped to handle your requests.
What has stayed the same
- WS/2 Search API – We know you devs hate doing that extra work to maintain your applications’ compatibility with that one site that changes its API on a whim. Well, we wouldn’t want you to spend those hours following that one int to float change that broke everything ever. As such we have worked hard to make sure that Solr doesn’t change any of our WS/2 search schema.
- WS/1 Search API – We deprecated WS/1 back in 2011. With the new search servers in place, there are only 3 words for those still using it after WS/1 being deprecated 7 years – ‘poof, it’s gone’. The service still works on our main website, but its search functionality will be phased out soon, while the entire service will be discontinued in August 2018 as announced earlier.
Now, you must be thinking there is some catch, some slip. Well so do I, which is why we are releasing this beta for you to test the heck out of our new search over at the MusicBrainz beta site. If you haven’t used it before, worry not – it has all your personalizations and all our cool music metadata from our main site. You should feel at home. (Note: The MusicBrainz beta site works on the live data. Any edits you make on the MusicBrainz beta site will also be reflected on the main site.)
So please! Go check it out!
If you feel you aren’t getting what we promised you or you want more of those shiny new features or that this blog was too long or like a TV commercial, feel free to complain at our Ticket Tracker for Solr. You get your promised features bug-free and our devs get to earn their living. It’s a win-win.
So as you might know, I recently joined the MetaBrainz team and my first project was the completion of our long-standing Solr search project to provide live search indexing for the MusicBrainz database.
I am happy to announce that we are finally rolling out an alpha release for you to test out. You can try it at https://test.musicbrainz.org/search or use the webservice end-point at https://test.musicbrainz.org/ws/2/
What this means –
- You can now instantly search for entities that have been updated. There should be a maximum 15 second delay between the database update and the entity changes being reflected on the search.
- This implies that once we have ironed out the Solr search we can finally retire the direct database search on the main site and use Solr with its advanced search syntax. For details on the new syntax features you can refer to the Lucene query parser documentation. For details on field types you can refer to our Search Syntax guide.
- As I said, the Solr search is still in its alpha stage, thus it can be unstable and have bugs. As such do not depend on it for your critical applications.
- Speaking of bugs, here’s where we need your help the most! We want testers to use Solr as extensively as possible and file any bugs you encounter at our Solr Issue tracker. You may encounter bugs like –
- Missing fields in the API output for the webservice.
- Certain types of queries not working in Solr search that happen to work on the main website.
- Missing data/edits/updates not being indexed.
- Since we haven’t ported our search analyzers in their entirety, Solr might have worse search results than our main search.
I would like to re-iterate – Solr is still in alpha and not everything is perfect. We need your help to make it so.
Given the utter slackers we are, we haven’t yet finished updating the search server to output the new MBIDs that were added to some entities in our last release. We’ll try and get that done soonish.
However, we did update the search code to fix this error in the search indexer:
ERROR: type “earth” does not exist
I’ve put both of these jar/war files on our FTP site:
- searchserver-a947c76.war (see UPDATE below)
If you would like to try and build these from source, you’ll need commit 4f677727 from mmd-schema and the latest master commit from search-server. For instructions on how to build this, please follow these instructions.
UPDATE: The build from the current master for search-server appears to not be able to load indexes upon startup. Please use the old war (we still use this in production) until we can release a fix.
UPDATE: Thanks to user selckin in the #lucene IRC channel for quickly solving this for us! Hopefully we can put this fix into production later today!
As our regular readers may know, we’ve been having lots of troubles with our lucene based search servers. Over the past few days we’ve spent a fair amount of time, tuning, debugging and otherwise trying to troubleshoot our setup. We’ve fixed and identified a number of problems, but most importantly we feel that we’ve identified the core issue: Our servers are simply overloaded.
Under normal conditions we find our servers loaded to about 25% – 35% CPU — things look good and we don’t think we have a capacity problem with our servers. Then a slow query comes in that starts to slow things down. Much like a traffic jam that evolves out of thin air, one slow query can make a giant mess for everyone.
We’ve started timing our queries and most of the time, they can be measured in milliseconds. However, when things get bad, they may take up to 7-8 seconds. Our upstream web servers time out on the search request after about 5 seconds in order to prevent traffic from getting backed-up. What we need to do next is to limit the duration that a lucene query can run and terminate it after the timeout.
I’ve started looking at this and quickly realized that this is much more of a job than adding a simple timeout parameter to the search call. We’re currently using this search function from IndexSearcher:
public TopDocs search(Query query, int n);
Ideally I would like to add a way to timeout queries after 3 seconds. So far, I’ve discovered that we could use
public void search(Query query, Collector results)
with a TimeLimitedCollector. The old call returns TopDocs and our code assumes that we have a TopDocs object from which to cull our search results. Having stared at the docs for lucene for a while, I haven’t found an way to convert the data in TimeLimitedCollector and convert it to TopDocs. It doesn’t make sense to me. 😦
How does one do this? Sadly, we have no Java programmers on our team, so we’re quite a bit out of our league here. Is there an easier way to do this? Would someone be willing to write this code for us and submit a PR? We’d find some really good chocolate and send it to you if you do!
More info on our project:
- This project provides critical search functions for MusicBrainz.
- The source lives here
- My attempt at converting this code to use the TimeLimitedCollector is here: https://bitbucket.org/metabrainz/search-server/commits/ce00b13b799c1e69e24fa87299342144ec481674
We are using Lucene 4.10.4 on a custom codebase that pre-dates SOLR — we have a new SOLR project to replace this one, but it isn’t quite done yet. (Again, not having Java programmers is a bit of a problem for us).
Any tips, explanations or pull requests would be deeply appreciated! Chocolate reward offered!