Here comes an end to a fantastic summer for this year and time to wrap up my GSoC project which I have been working in for the last 3 months (the official GSoC coding period).
I am Rashi Sah, an undergraduate student at the National Institute of Technology, Hamirpur, India. I have been working on a really cool AcousticBrainz project for MetaBrainz Foundation Inc. as a participant in Google Summer of Code ‘18. It has been an amazing experience and I’ve learned a lot over the summer, spending countless days and nights to successfully take the project to the stage of completion. I decided to contribute to MetaBrainz in late December, then spent some time understanding the codebase of the project and then began creating pull requests and pushing commits for many features, tasks and fixing bugs since January 2018. This blog post consists of my GSoC experience as a student and the work I’ve done for the program so far.
Before starting the GSoC program, I started looking for some good-first-bugs initially and found some tickets to work on. Then I talked to the AcousticBrainz community members and started contributing. I created some big PRs mostly for adding new features to AcousticBrainz. I also worked on many bug fixes which are already merged into the AcousticBrainz codebase. New feature additions PRs include AB-21, AB-98 and AB-298. In mid‐February, I started looking for a suitable idea to work on for GSoC program and to create a proposal for the same. As the month of March was approaching, I did a lot of proposal discussion with MetaBrainz community members especially with Alastair, AcousticBrainz project lead who has helped me a lot in reviewing and guiding me to improve my proposal to a better extent. Later April, my proposal for a more detailed integration of AcousticBrainz with MusicBrainz got accepted. In the community bonding period, I mostly tried to continue my work which I was already doing for the past 3–4 months.
Getting entity information from the MusicBrainz database
The first thing I worked on when the official GSoC coding period began was adding a way to directly access MusicBrainz database for different entities to the MusicBrainz database module in BrainzUtils (a Python utility for all of our MetaBrainz projects). I worked on getting artist and release entity information from the MusicBrainz database via a direct connection. (See PRs BU-13 and BU-14.) Later, I worked on setting up the MusicBrainz server by adding a service in AcousticBrainz’s docker-compose files allowing us to easily read data directly from the MusicBrainz database in AcousticBrainz (PR AB-334). Our major aim of the project was to implement both the methods of MusicBrainz database access in AcousticBrainz especially importing the MusicBrainz database in AcousticBrainz from scratch and then to decide which methods works better while implementing a particular functionality in AcousticBrainz using MusicBrainz data.
Import the MusicBrainz data in AcousticBrainz database
MusicBrainz’s database contains a huge number of tables, but I analysed the use case of MB data in AB and made a list of those tables that we would actually require in our AcousticBrainz integrations. Then I made a PR (AB-338) for creating new tables in the AB database under the MusicBrainz schema. Later, I worked on a big PR (AB-340) which imports MB data corresponding to each and every recording present in AcousticBrainz’s database and writes the data into the tables of the MusicBrainz schema in AB. This PR was really huge and I had to take care of a lot of integrity constraints and foreign key dependencies.
Update MB data in AB for every new recording added to AB
Another feature I worked on after importing the MB data was updating the MB data present in AB whenever any new recording is added to the AcousticBrainz database (see PR AB-346) by importing the data from MB’s database via the direct connection. While working on a few bug fixes, I and my mentor, Param realized that the MB data import is taking a lot more time than expected when I applied the MusicBrainz importer script for full MB data dumps (of around 2.8 GB). So, I then worked on making the MusicBrainz importer more efficient and was able to import the data for few recordings within seconds (see PR AB-348). I had to figure out a lot for each table import and to detect the parts of the code which were making things slower.
To reduce the load on the processor, I included a sleep schedule of 5 seconds in the MusicBrainz importer module to wait before importing data for any new recording (see PR AB-354). During my GSoC period, I learned how important it is to write tests and make them run fast. I wrote tests for almost every script inside the db module. Later, I worked on writing tests for the MusicBrainz importer script (AB-352).
Apply replication packets to keep MB data in AB updated with the actual MusicBrainz database
Then came another tricky part of this project which was to update the MusicBrainz schema data in AB whenever there is any change in the actual MusicBrainz database whether it is an update or a deletion taking place. MusicBrainz provides hourly replication packets which describe the changes to the database in a specific period. Replication packets are .tar.bz2 archives with a collection of files in them which can be downloaded via the MetaBrainz API. Lukas Lalinsky, a long-time contributor to MetaBrainz projects, the founder of AcoustID and maintainer of the mbdata Python module, had worked on implementing replication packets on MB data. I did a lot of modifications in his script to apply replication packets to the MusicBrainz schema data till it’s recent update for the recordings data present in AcousticBrainz (see AB-350).
Integration with MB database: Use MBID redirect information to get original entity
After working on the direct connection and importing the MusicBrainz data, keeping it updated by all means, it was time to start working on writing evaluation scripts to decide the better method for any integration we apply in AcousticBrainz. I wrote a script to implement an integration in AB with MB database to use the redirect information of an entity and then returns the original entity corresponding to the MBID provided (see PR AB-356).
Evaluate both methods of MusicBrainz database access in AcousticBrainz
Now moving towards the last work of my GSoC period and the most important as well. After working on both the methods, we really needed to evaluate both in order to test which one is more efficient for any specific integration with the MB database. I first wrote an evaluation script which fetches the data from the recording and low-level tables. For this case, the difference between the time taken by both methods comes out to be really large (approx. 70 seconds for around 250+ recordings). So whenever we would have to get the data from local AB tables and MB tables as well, we would go for the import database method as this method turns out to be faster than the other one. Next I tested with the MBID redirect integration part in which I didn’t find much difference between both the methods (PR AB-357). But I ran these tests locally, the tests in production may yield different results.
All in all, it has been an exciting summer. By this time I am familiar with a very good part of the AcousticBrainz codebase. I really look forward to work on adding a lot more integrations with MB data in AcousticBrainz and plan to completely remove AB’s dependency over the web service to use the MusicBrainz database which would be very useful for the users.
Details of contributions made
By the end of the GSoC coding period, I have opened a total of 39 PRs of which 35 are pull requests to the AcousticBrainz server, 3 are pull requests to BrainzUtils and 1 pull request to the AcousticBrainz client and have made a total of 135 commits (109 in AB, 9 in BU, 3 in AC and 14 in AB master) and out of them, pull requests created and merged during the official GSoC coding period are PRs to AcousticBrainz server and PRs to Brainzutils.
These last three months were full of thrill, excitement and much frustration as well. And this doesn’t end here, I’d love to contribute in the future and act as a maintainer for the AcousticBrainz project. I believe people must try to contribute to open source organizations as it helps you learn and gain much experience in a short period of time especially when working for a great platform like Google Summer of Code.
I am really happy working with the awesome MetaBrainz community and the people here are fantastic. I’d love to stay being a part of MetaBrainz in future as well. So in the end a big thanks to my mentor Param Singh, without his help & support throughout the program, wouldn’t have been possible for me to reach the end phase of GSoC, and my organization admin Robert Kaye, AcousticBrainz project lead Alastair Porter and all of the MetaBrainz Foundation community members for choosing me as a GSoC student and thus providing me such a great opportunity and also for being very kind and helpful throughout the program. And I want to thank Google for making this all possible. Hope I get a chance to work with you all again!!
4 thoughts on “GSoC 2018: More detailed integration of AcousticBrainz with MusicBrainz”
Well written and well done Rashi.
One of my issues has related to the correct analysis of eg. video tracks that have been FLACked or MP3ed. As a user I wish to have these correctly tagged (and AcoustIDed).
There’s a danger when cleaning data for processing that valuable information on exceptional tracks may be lost.
All the best for your future anyway. WC