GSoC 2018: More detailed integration of AcousticBrainz with MusicBrainz

Here comes an end to a fantastic summer for this year and time to wrap up my GSoC project which I have been working in for the last 3 months (the official GSoC coding period).

Hello people!!

I am Rashi Sah, an undergraduate student at the National Institute of Technology, Hamirpur, India. I have been working on a really cool AcousticBrainz project for MetaBrainz Foundation Inc. as a participant in Google Summer of Code ‘18. It has been an amazing experience and I’ve learned a lot over the summer, spending countless days and nights to successfully take the project to the stage of completion. I decided to contribute to MetaBrainz in late December, then spent some time understanding the codebase of the project and then began creating pull requests and pushing commits for many features, tasks and fixing bugs since January 2018. This blog post consists of my GSoC experience as a student and the work I’ve done for the program so far.

Before starting the GSoC program, I started looking for some good-first-bugs initially and found some tickets to work on. Then I talked to the AcousticBrainz community members and started contributing. I created some big PRs mostly for adding new features to AcousticBrainz. I also worked on many bug fixes which are already merged into the AcousticBrainz codebase. New feature additions PRs include AB-21, AB-98 and AB-298. In mid‐February, I started looking for a suitable idea to work on for GSoC program and to create a proposal for the same. As the month of March was approaching, I did a lot of proposal discussion with MetaBrainz community members especially with Alastair, AcousticBrainz project lead who has helped me a lot in reviewing and guiding me to improve my proposal to a better extent. Later April, my proposal for a more detailed integration of AcousticBrainz with MusicBrainz got accepted. In the community bonding period, I mostly tried to continue my work which I was already doing for the past 3–4 months.

Getting entity information from the MusicBrainz database

The first thing I worked on when the official GSoC coding period began was adding a way to directly access MusicBrainz database for different entities to the MusicBrainz database module in BrainzUtils (a Python utility for all of our MetaBrainz projects). I worked on getting artist and release entity information from the MusicBrainz database via a direct connection. (See PRs BU-13 and BU-14.) Later, I worked on setting up the MusicBrainz server by adding a service in AcousticBrainz’s docker-compose files allowing us to easily read data directly from the MusicBrainz database in AcousticBrainz (PR AB-334). Our major aim of the project was to implement both the methods of MusicBrainz database access in AcousticBrainz especially importing the MusicBrainz database in AcousticBrainz from scratch and then to decide which methods works better while implementing a particular functionality in AcousticBrainz using MusicBrainz data.

Import the MusicBrainz data in AcousticBrainz database

MusicBrainz’s database contains a huge number of tables, but I analysed the use case of MB data in AB and made a list of those tables that we would actually require in our AcousticBrainz integrations. Then I made a PR (AB-338) for creating new tables in the AB database under the MusicBrainz schema. Later, I worked on a big PR (AB-340) which imports MB data corresponding to each and every recording present in AcousticBrainz’s database and writes the data into the tables of the MusicBrainz schema in AB. This PR was really huge and I had to take care of a lot of integrity constraints and foreign key dependencies.

Update MB data in AB for every new recording added to AB

Another feature I worked on after importing the MB data was updating the MB data present in AB whenever any new recording is added to the AcousticBrainz database (see PR AB-346) by importing the data from MB’s database via the direct connection. While working on a few bug fixes, I and my mentor, Param realized that the MB data import is taking a lot more time than expected when I applied the MusicBrainz importer script for full MB data dumps (of around 2.8 GB). So, I then worked on making the MusicBrainz importer more efficient and was able to import the data for few recordings within seconds (see PR AB-348). I had to figure out a lot for each table import and to detect the parts of the code which were making things slower.

To reduce the load on the processor, I included a sleep schedule of 5 seconds in the MusicBrainz importer module to wait before importing data for any new recording (see PR AB-354). During my GSoC period, I learned how important it is to write tests and make them run fast. I wrote tests for almost every script inside the db module. Later, I worked on writing tests for the MusicBrainz importer script (AB-352).

Apply replication packets to keep MB data in AB updated with the actual MusicBrainz database

Then came another tricky part of this project which was to update the MusicBrainz schema data in AB whenever there is any change in the actual MusicBrainz database whether it is an update or a deletion taking place. MusicBrainz provides hourly replication packets which describe the changes to the database in a specific period. Replication packets are .tar.bz2 archives with a collection of files in them which can be downloaded via the MetaBrainz API. Lukas Lalinsky, a long-time contributor to MetaBrainz projects, the founder of AcoustID and maintainer of the mbdata Python module, had worked on implementing replication packets on MB data. I did a lot of modifications in his script to apply replication packets to the MusicBrainz schema data till it’s recent update for the recordings data present in AcousticBrainz (see AB-350).

Integration with MB database: Use MBID redirect information to get original entity

After working on the direct connection and importing the MusicBrainz data, keeping it updated by all means, it was time to start working on writing evaluation scripts to decide the better method for any integration we apply in AcousticBrainz. I wrote a script to implement an integration in AB with MB database to use the redirect information of an entity and then returns the original entity corresponding to the MBID provided (see PR AB-356).

Evaluate both methods of MusicBrainz database access in AcousticBrainz

Now moving towards the last work of my GSoC period and the most important as well. After working on both the methods, we really needed to evaluate both in order to test which one is more efficient for any specific integration with the MB database. I first wrote an evaluation script which fetches the data from the recording and low-level tables. For this case, the difference between the time taken by both methods comes out to be really large (approx. 70 seconds for around 250+ recordings). So whenever we would have to get the data from local AB tables and MB tables as well, we would go for the import database method as this method turns out to be faster than the other one. Next I tested with the MBID redirect integration part in which I didn’t find much difference between both the methods (PR AB-357). But I ran these tests locally, the tests in production may yield different results.

All in all, it has been an exciting summer. By this time I am familiar with a very good part of the AcousticBrainz codebase. I really look forward to work on adding a lot more integrations with MB data in AcousticBrainz and plan to completely remove AB’s dependency over the web service to use the MusicBrainz database which would be very useful for the users.

Details of contributions made

By the end of the GSoC coding period, I have opened a total of 39 PRs of which 35 are pull requests to the AcousticBrainz server, 3 are pull requests to BrainzUtils and 1 pull request to the AcousticBrainz client and have made a total of 135 commits (109 in AB, 9 in BU, 3 in AC and 14 in AB master) and out of them, pull requests created and merged during the official GSoC coding period are PRs to AcousticBrainz server and PRs to Brainzutils.

These last three months were full of thrill, excitement and much frustration as well. And this doesn’t end here, I’d love to contribute in the future and act as a maintainer for the AcousticBrainz project. I believe people must try to contribute to open source organizations as it helps you learn and gain much experience in a short period of time especially when working for a great platform like Google Summer of Code.

I am really happy working with the awesome MetaBrainz community and the people here are fantastic. I’d love to stay being a part of MetaBrainz in future as well. So in the end a big thanks to my mentor Param Singh, without his help & support throughout the program, wouldn’t have been possible for me to reach the end phase of GSoC, and my organization admin Robert Kaye, AcousticBrainz project lead Alastair Porter and all of the MetaBrainz Foundation community members for choosing me as a GSoC student and thus providing me such a great opportunity and also for being very kind and helpful throughout the program. And I want to thank Google for making this all possible. Hope I get a chance to work with you all again!!

GSoC 2018: A way to associate listens with MBIDs

Hi, I’m Kartikeya Sharma, a postgrad student at National Institute of Technology, Hamirpur. I’ve worked on the project MessyBrainz as a student developer for GSoC 2018. Robert Kaye mentored me during this GSoC programme. The goal of my project is associating MBIDs to MSIDs and clustering together the MSIDs which represents the same MBID. The MBIDs represent MusicBrainz Identifier. It is an Universally Unique Identifier that is permanently assigned to each entity in the MusicBrainz database, MSID represents MessyBrainz Identifier which is associated with each unique recording, artist_credit and release in MessyBrainz database. In simple words MSIDs represents unclean metadata whereas MBIDs represent clean metadata.

This blog post summarizes the work that I did in my project, which was divided into three parts.

Processing the data already in MessyBrainz database

The first part involves creating clusters using the MBIDs already present in the MessyBrainz database. This involves creating clusters for recordings, artists, and releases. To implement this part I created the following three PRs #37, #41, and #44.

After that, I began to work on the second part of this which involves creating clusters using the artist MBIDs and release MBIDs and names fetched from MusicBrainz database. I needed to access MusicBrainz database, for that, I first had to work on BrainzUtils to have methods to access MusicBrainz database to fetch artist MBIDs using recording MBIDs and release name and release MBIDs using recording MBIDs. The part to fetch artist MBIDs was done during the community bonding period in PR #14 at BrainzUtils and to fetch releases I created PR #18 at BrainzUtils during GSoC coding period. After that, I created a PR to create clusters using the fetched artist MBIDs #47 and another one to create clusters using releases fetched #49.

I did write around 60 tests which proved to be vital in making sure that the code does what it’s supposed to do.

Processing the data as it is inserted into the MessyBrainz database

Creating clusters for the data inside the database requires a lot of resources. So, it was better to create clusters as recordings are inserted into the database but, even this type of clustering is not efficient. So, to cluster these recordings first these recordings are sent to rabbitMQ server and from that, these are sent to a clustering script which runs in a different container and runs continuously and clusters the incoming recordings. That way it does not slow down the process of submitting recordings to the database. For this I created PR #50.

Create endpoints to access MSIDs and MBIDs

I created two API endpoints in PR #51.One endpoint is to fetch MBIDs and MSID using an MSID. Another endpoint is to fetch MSIDs using an MBID. This way end users can access MBIDs and MSIDs which may be used for calculating different stats.

Apart from that with the help of my mentor, I did setup a VM to test the above code on the MsB datadump. This task had some challenges: first I had to create indexes for various fields to speed up the process of clustering. Without indexing, it would have taken approximately 37 days but after creating indexes on various fields It just took 3 hours. I found out that PostgreSQL does allow to create indexes on functions too which came into use while creating artist_credit clusters for which I created a custom function. Indexes were created in PR #53. When I ran the clustering code on a VM on which the whole MessyBrainz datadump was present I found out that we have fields in recording_json table which are supposed to store MBIDs but were pointing to empty strings. This was not supposed to happen initially as ListenBrainz is the only source of data for MessyBrainz currently. Submissions to MessyBrainz are restricted from users directly and ListenBrainz does validate listens for that. So, those recordings must have been inserted before that validation was present. To solve the problem I created PR #52.

The summer was a great learning experience for me. I started slowly as things were messy at the start. As at the start everything wasn’t crystal clear to me, I wasn’t sure on how exactly to write scripts that manipulate database and did write the scripts in the most trivial way possible. Here I was doing a query for every single MBID to first check if it’s present in the recording_cluster tables and if not then cluster the recording. Which is conceptually correct but not efficient by any means. And this could be done by executing a single query on the recording_json table to fetch only those recording MBIDs that are not present in recording_redirect table as those are unclustered. That way we don’t have to process the recording MBIDs that have been already processed making the process of clustering efficient.

With time I got an understanding of how clusters are created and how to handle anomalies. Such as James Morrison. In the end, the definition of anomaly can be put as an MSID represents an anomaly if it points to different MBIDs in entity_redirect table (entity can be artist_credit, recording, and release).

Work to be done ahead

The project is still in its initial stages and requires a lot of work to be done before moving it into production. We still need to write integration tests for ClusterWriter and API endpoints. After that, we can work on the Additional Ideas that I proposed in my proposal. We need to figure out some way to associate MBIDs to MSIDs for the artists, recordings, and releases where no MBIDs are present. This does not seem like a trivial task with so many anomalies to take care of.

Last three months have been a great experience for me. I would like to thank Robert Kaye, Param Singh, and Alastair Porter who helped me to solve a lot of problems that I encountered during the entire period. Working on their suggestions and reviews I was able to write good quality code which was efficient as well. The work culture at MetaBrainz inspired me a lot. At MetaBrainz we have weekly IRC meetings where we get to know what others are doing at the organization and also get a place to tell what we did in our past week. I would like to thank MetaBrainz and Google for giving me this chance to get involved in open source on such a cool project. The association of MSIDs to MBIDs can be used by ListenBrainz as stats are calculated on MSIDs which can then be mapped onto MBIDs which represents clean metadata. I would like to work on the project further because of the learning opportunities that are present in the project.

Picard 2.0.3 released: Crash-fixes and scripting improvements

This is a minor release that fixes a lot of crashes and unicode errors on certain platforms. It also reverts a scripting improvement (PICARD-259) which had caused a couple of scripting bugs (PICARD-1207). Scripting now works exactly like Picard 1.4.

As usual, you can find the latest downloads on Picard’s Website.

The change-log is as follows –

Release Notes – Picard – Version 2.0.3

Bug

  • [PICARD-1122] – Preferred release type settings are exclusive and should be inclusive
  • [PICARD-1207] – Move additional files feature fails when source directory contains non-ascii characters
  • [PICARD-1247] – Not all “preserved” tags are preserved
  • [PICARD-1305] – Search dialog crashes picard when record doesn’t have an album
  • [PICARD-1306] – picard crashes when opening the options dialog if the cwd doesn’t exist

New Feature

  • [PICARD-1289] – Allow manually running any tagger script

Improvement

  • [PICARD-1292] – MusicBrainz Picard 2.01 64-bit for windows installs to “C:\Program Files (x86)” by default
  • [PICARD-1302] – Dropping an image from Google image crashes picard
  • [PICARD-1303] – picard crashes when matching a cluster with a release with no tracks
  • [PICARD-1304] – Info dialog for album crashes because track doesn’t have a tracknumber

Regression

  • [PICARD-259] – Make file-specific variables available to tagger script

 

samj1912 out o/

 

Picard 2.0.2 released! Signed macOS builds

This is a minor release that fixes some crashes due to logging events and compatibility issues on Macs running on dual-core processors.

I would like to extend a word of thanks to Francois Ferrand and Ryan McKern, following whose advice, bitmap was able to successfully fix our macOS packaging and code-signing issues, details of which can be found in our recent CFH blog.

As usual, you can find the latest downloads on Picard’s Website.

The change-log is as follows –

Sub-task

Task

Bug

  • [PICARD-342] – Picard is not properly signed for Mac OS X Gatekeeper
  • [PICARD-1212] – Picard 2.0.0dev4 crashing at startup
  • [PICARD-1300] – Picard crashes when logging lots of events

samj1912 out o/

Call for help: Picard 2.0 macOS packaging

Hello everyone,

As you might know, we recently released Picard 2.0 stable. One of the major problems with the macOS version of the same is that is very unreliable. It works perfectly on some systems and doesn’t on others with the same macOS version. See PICARD-1212 for example.

Another major problem we are facing is code-signing Picard 2.0. In order to ensure that our macOS users have a seamless experience, we paid for an Apple dev account, but we are unable to code-sign Picard.  See PICARD-1296

If you have experience with either and are willing to help, please email us at – support@metabrainz.org or join us on irc at #metabrainz (freenode).

 

Picard 2.0.1 released! (Windows and macOS users rejoice)

Note – There are no changes for Linux users, so they can safely skip this release if they want.

Given the massive feedback about the shortcomings of the Windows and macOS versions of Picard, we decided to do a minor release addressing some of the issues with our executables.

As usual, you can find the latest downloads on Picard’s Website.

The change-log is as follows –

Bug-fix

  • [PICARD-1283] – Fingerprinting not working on macOS in Picard 2.0
  • [PICARD-1286] – Error creating SSL context on Windows

Improvement

  • [PICARD-1290] – Improve slow start up times by moving to a non single file exe
  • [PICARD-1291] – Use an installer for Picard 2.x windows exe

Basically, the Windows executable is now a proper installer and some missing SSL dependencies are bundled with it.

The macOS builds also include the missing AcoustID fingerprinting binary.

The startup time for both the Windows and macOS version has been improved as well.

Have fun tagging your files!

samj1912 signing off o/

 

Picard 2.0 released

Hey people, samj1912 here again o/

This time we are announcing the release of a new Picard!

Official MusicBrainz cross-platform music tagger Picard 2.0 is now out, containing many fixes and new features and much needed upgrades!

The last time we put out a major release was more than 6 years ago (Picard 1.0 in June of 2012), so this release comes with a major back-end update. If you’re in a hurry and just want to try it out, the downloads are available from the Picard website.

If you have been following our Picard related blogs, you will know that we switched up our dependencies a bit. Python should now be at least version 3.5, PyQt 5.7 or newer and Mutagen should be 1.37 or newer. A side effect of this dependency bump is that Picard should look better and in general feel more responsive.

A couple of things to note – with Picard 2.0, Picard Windows builds will be portable standalone binaries. Also, we will only be supporting 64-bit Windows officially because of lack of resources to build a 32-bit image. The macOS requirements were also bumped up for the same reasons, with macOS 10.10 being the lowest version that is supported.

As such, Picard 1.4.2 will be the last version that is supported for both Windows 32 and macOS 10.7-10.10. You can find it in the Picard downloads section as well.

You can find a detailed change-log on the Picard webiste.

The highlights of this update are –

  • Retina and Hi-DPI display support
  • Improved performance
  • UI improvements

We would like to thank all contributors, from all around the world, who helped for this release: Laurent Monin, Sophist, Wieland Hoffmann, Vishal Choudhary, Philipp Wolfer, Calvin Walton, David Mandelberg, Paul Roub, Yagyansh Bhatia, Shen-Ta Hsieh, Ville Skyttä, Yvan Rivierre and also all of our translators!

Be aware that downgrading from 2.0 to 1.4 may lead to configuration compatibility issues – ensure that you have saved your Picard configuration before using 2.0 if you intend to go back to 1.4.

Note:  If you are facing errors while tagging releases on Windows, do take a look at this FAQ about SSL errors.

We’ve finally released our new Solr search (Server update, 2018-06-30)

The new search is live on MusicBrainz with this server update, as announced in previous blog post. This release also continues the rewrite to React, improves and fixes the handling of external URLs. The git tag is v-2018-06-30.

Sub-task

  • [MBS-9736] – Convert the artist search results page to React

Bug

  • [MBS-8334] – Digest auth with username containing non-ascii characters fails
  • [MBS-9730] – Cannot link to a Bandcamp Daily review page in release group relationships
  • [MBS-9734] – Inconsistency between the JSON search API and the lookup/browse one in ws/2/
  • [MBS-9742] – Some Library of Congress URLs are not recognised
  • [MBS-9743] – Beatport URL cleanup fails for names starting with digit

Improvement

  • [MBS-9408] – Add Juno Download links to the sidebar
  • [MBS-9439] – Make changing a URL between HTTP and HTTPS an autoedit
  • [MBS-9740] – Update Facebook URL cleanup

Many people thought this day would never come. 🙂

Releasing our new Solr search infrastructure

Hey folks, samj1912 here again o/

As you might know, we recently did a massive upgrade of our search infrastructure. If you have not been following our Solr updates, definitely check out our other blog detailing our search server journey and the improvements and changes that come with the new search.

We have had a beta run with Solr this last week and fixed most of the show-stopping bugs. As such, we have been stress testing our Solr search by replaying our production logs on it, live.

Solr search seems to solvr almost all our qualms with search and as such, we have made the decision to use Solr for our production search servers.

The purpose of this blog post, as nicely worded by our BDFL Rob is –

Speak now or forever hold your pickle. In a week, the ole search servers gets it.

And it’s basically that, if you haven’t experimented with Solr search, please read our earlier blog to know what’s what. If you find any bugs, please report them on our ticket tracker. In case there are no new show-stoppers reported, that must absolutely be fixed before we switch to Solr on the main website, we will be killing the old search servers and replacing them with our brand new Solr ones in a week.

Apart from that, we have made a discourse thread to report any minor improvements in the search results.

Another thing, I’d like to remind everyone is that, with our switch to this new Solr infrastructure, the version 1 web service (ws/1)will soon be discontinued. As announced earlier, we will keep it alive till 31st July 2018 but it will get the axe on 1st August 2018, 12 pm GMT.

MusicBrainz Search Overhaul

Hello people o/, samj1912 here.

I am extremely glad to announce that we are finally launching our Solr search on the MusicBrainz beta server!

Just a little history before I announce the new features and toys you get to play with:

Solr started as something that could replace our existing search infrastructure. If you have been a MusicBrainz user for a while, you might know that our search has quite an indexing latency and it takes as much as 3 hours for new edits to show up in the search results. In part because updating the search index involved doing an entire re-index of the database. With the high latency and the resources it took, the current search server left much to be desired.

Another area that our current search lacked in, was showing popular results and search ranking. Searching for a famous artist or place returned results that contained a lot of noise, and more often than not, contained results that weren’t relevant to what the user had in mind when they searched for it.

These were the two major problems that motivated us to shift to a better infrastructure for our search needs.

Thus, MB-Solr was born.

It has been in development for quite some time now. The coding for the project started with Mineo back in 2014 and was carried forward by Jeff Weeksio in GSoC 2015. But due to lack of development resources and other, more pressing needs, the project was put on a hold for a while, until Roman started working on it. However, he left MetaBrainz before he could finish this work, so when I joined the MetaBrainz team, the first and foremost task that was assigned to me was getting Solr working and ready for production.

After struggling with multiple moving parts and services, tons of issues with maintaining compatibility with our existing web-service API, rowing up and down multi-threading/processing hell, learning just enough about information retrieval to get our search relevance on point and countless hours sifting through Solr documentation to get our Solr cluster fine-tuned and running fast enough to keep up with our web traffic… we are finally here.

I am pretty sure I would’ve rage-quit dozens of times during this last year if I was doing this all alone.

As such, we have our trusty sysadmin Zas to thank for taking care of all the deployment needs and making sure Solr was well-tested (believe me we toyed with Solr like little kids in a sandbox) and wasn’t going to fail and wake him up 3 AM in the morning with red alerts all over. Mineo, Bitmap and Yvanzo were there, with much-needed code reviews and help with all things Solr and MusicBrainz. Our style leader Reosarevok, and CatQuest helped us test our new search relevance configuration. And of course, we had our BDFL, Rob over-seeing things and whipping them into shape (with chocolate and mismatched socks of course).

Anyway, here’s what you are here for:

New features/improvements

  • (Almost) Instantaneous search-index updates – Edit something and immediately see it in the search results. Say goodbye to that note you used to see below the search telling you that you have to wait. Who likes waiting anymore – seriously, it’s 2018.
  • Better search results – We wanted to make sure you were getting the right Queen and London as the top result. You can finally link your favorite artist to London, UK as opposed to London, Arkansas. Don’t believe me? Go try it out.
  • Less load on our servers – Meaning we can serve more of your requests, faster. Getting tired of waiting for tagging your bajillion songs in Picard? Well, you still gotta wait, but less so, now that we are better equipped to handle your requests.

What has stayed the same

  • WS/2 Search API – We know you devs hate doing that extra work to maintain your applications’ compatibility with that one site that changes its API on a whim. Well, we wouldn’t want you to spend those hours following that one int to float change that broke everything ever. As such we have worked hard to make sure that Solr doesn’t change any of our WS/2 search schema.

What’s gone

  • WS/1 Search API – We deprecated WS/1 back in 2011. With the new search servers in place, there are only 3 words for those still using it after WS/1 being deprecated 7 years – ‘poof, it’s gone’. The service still works on our main website, but its search functionality will be phased out soon, while the entire service will be discontinued in August 2018 as announced earlier.

Now, you must be thinking there is some catch, some slip. Well so do I, which is why we are releasing this beta for you to test the heck out of our new search over at the MusicBrainz beta site. If you haven’t used it before, worry not – it has all your personalizations and all our cool music metadata from our main site. You should feel at home. (Note: The MusicBrainz beta site works on the live data. Any edits you make on the MusicBrainz beta site will also be reflected on the main site.)

So please! Go check it out!

If you feel you aren’t getting what we promised you or you want more of those shiny new features or that this blog was too long or like a TV commercial, feel free to complain at our Ticket Tracker for Solr. You get your promised features bug-free and our devs get to earn their living. It’s a win-win.

Happy testing!