Moving AcousticBrainz to Hetzner

Hi, all. I worked on the recent migration of AcousticBrainz to the central Hetzner infrastructure that hosts all our other projects. It was a fun experience that I would like to share on this blog.

This was the first time I worked with a production database of this scale, and it was a real learning experience. It really felt like I had jumped off the deep end, but it was really fun!

For those that don’t know, AcousticBrainz is a music technology project which crowd sources acoustic information for music recordings and is a collaboration between the Music Technology Group at Universitat Pompeu Fabra and MetaBrainz. AcousticBrainz has already collected information about 3.7 million unique recordings and has individual submissions from users for over 11 million recordings.

All the data is stored in a single PostgreSQL database for now. The server that AcousticBrainz used to run on (we called it spike, after the Tom and Jerry character) had gotten old and started spitting out hard disk failure warnings, so we decided to move it to the central Hetzner infrastructure where other MetaBrainz projects are hosted.

We use Docker for all services running in Hetzner and it has worked pretty well for us so far. So the first task was creating a production Docker environment for AcousticBrainz. Consul is used to provide configuration values for the AcousticBrainz server which needed some new code and consul template files to be written. This is relatively simple stuff that did not take too long. We also have a repository to store all configuration values and scripts that need to be run on each of our servers. So I also wrote code to run the three different services that AcousticBrainz needs in different Docker containers.

After that, I started work on creating data dumps of the AcousticBrainz data. There was already some code that dumped the entire database into an lzma compressed file. However, it was old code that hadn’t been run in a long time and the database had gotten biiig since then. The way the code worked was that it dumped each table as a file into a directory and then added the entire directory at once into a tar file. However, this approach doesn’t work now, because the table that stores the low‐level JSON data that users submit to us has become too big to be stored in a single text file uncompressed. The lowlevel_json file has 11 million rows right now with each row containing a relatively large JSON document stored in a column of Postgres’ cool JSONB type. The table takes around 357 GB when stored inside Postgres and this balloons to much over the space we had on spike. So, I wrote some code that dumped 500,000 rows into a file and compressed it before dumping the next 500,000 rows.

The compressed AcousticBrainz data dump was around 169 GB in size which seemed reasonable. Then, I realized that the server we were planning to run the webserver on (called boingo, after Oingo Boingo) did not have enough storage space or computational power to hold and work with the database. This led to us getting a new shiny server called frank (after Frank Ocean!) which has a pretty big 7200 RPM hard disk and over a 100 GB of RAM. We also decided to upgrade to PostgreSQL 10 during the migration, which led me to creating a Docker image for PostgreSQL 10 that we could use in production.

After this, I imported the data into the empty Postgres server which worked pretty well. Everything seemed set for a small downtime for migration where we’d just create a small incremental data dump, move it to frank and import, bring spike down, bring the webserver up on frank and be done with it. The The steps were written up and we were ready to go.

Things started, I brought the site down on spike, created an incremental dump, imported it to frank. Everything worked. We decided to do a integrity check of the new database once before bringing the new site up. This is where the trouble started. The number of rows in one of the the tables was 10 million when it should have been around a 100M, yikes. We realized that there had been a bug in the original data dump code that
we’d written. It was a pretty small bug, the key we were using to dump the data was incorrect. One line fix. I thought that we need more tests for our data dumps code.

Well, at that point we decided to just go ahead and dump and import the table individually instead of stopping the whole process. The downtime was much longer than expected because of this, the table was pretty big, 100 million rows is no joke, it took pg_dump hours to dump it. Then, I dropped the table on frank and began an import of the dumped file. We had decided to not drop constraints before importing for sanity reasons, but that turned out to not be that good of an idea. It took the import 5–6 hours before it was even halfway done and the time to import new rows was increasing. We gave up, stopped the import and dropped all constraints before starting a new clean import. This worked much much faster and was done in around an hour. At that point, we did another sanity check of the database, before bringing the site back up.

Some static files like binaries and old dumps we linked to were still hosted on spike (another thing I missed!), so I had to whip up a quick pull request changing links temporarily. I was doing this at 3 in the morning and I had started working on this the previous day 11 in the morning. It was the longest, most intense production deployment I have ever done. Pretty fun—now that I think about it—but I was tired then.

Later, I set up an FTP server on frank and moved the static files we were hosting there.

There were a lot of things that I learned in this entire process. First thing was that we should really sanity check literally everything before bringing any production service down. Second thing was that importing data with constraints in a database (especially for large amounts of data) is not very feasible. Third is that this level of control is not something that I would ever get as a new grad in any big company. Being thrown off the deep end here at MetaBrainz was really awesome. Another thing that I forgot to mention was that the entire migration process was done remotely over IRC with me sitting in college in Hamirpur, India and my teammates in Barcelona. This really teaches efficient communication and teamwork.

In hindsight, there are a few things that I’d do differently given the chance again. I’d definitely have sanity checked the imported database before actually going through with the downtime. It would have saved a lot of pain and the downtime would have been much lower. This is the biggest thing I learned from the migration process. Sanity check as often as possible.

All in all, working with production grade big data projects has been pretty awesome, and I hope I continue to learn as much as possible as early as possible.

AcousticBrainz at the 2018 MetaBrainz Summit

We had an in-person meeting at the MTG during the MetaBrainz summit to discuss the status and future of AcousticBrainz. We came up with a rough outline of things that we want to work on over the next year or so. This is a small list of tasks that we think will have a good impact on the image of AcousticBrainz and encourage people to use our data more.

State of AcousticBrainz

AcousticBrainz has a huge database of submissions (over 10 million now, thanks everyone!), but we are currently not using the wealth of data to our advantage. For the last year we’ve not had a core developer from MetaBrainz or MTG working on existing or new features in AcousticBrainz. However, we now have:

  • Param, who is including AcousticBrainz in his role with MetaBrainz
  • Rashi, who worked on AcousticBrainz for GSoC and is going to continue working with us
  • Philip, who is starting a PhD at MTG, focused on some of the algorithms/data going into AcousticBrainz
  • Alastair, who now has more time to put towards management of the project

Because of this, we’re glad to present an outline of our next tasks for AcousticBrainz:

Short-term

Some small tasks that are quick to finish and we can use to show off uses of the data in AcousticBrainz

Merge Philip’s similarity, including an API endpoint

Philip’s masters thesis project from last year uses PostgreSQL search to find acoustically similar recordings to a target recording. This uses the features in AcousticBrainz. We need to ensure that PostgreSQL can handle the scale of data that we have.

An extension of this work is to use the similarity to allow us to remove bad duplicate submissions (we can take all recordings with the same MBID and see if they are similar to each other, if one is not similar we can assume that it’s not actually the same as the other duplicates, and mark it as bad). We want to make these results available via an API too, so that others can check this information as well.

Merge Existing PRs

We have many great PRs from various people which Alastair didn’t merge over the last year. We’re going to spend some time getting these patches merged to show that we’re open to contributions!

Publish our Existing models

In research at MTG we’ve come up with a few more detailed genre models based on tag/genre data that we’ve collected from a number of sources. We believe that these models can be more useful that the current genre models that we have. The AcousticBrainz infrastructure supports adding new models easily, so we should spend some time integrating these. There are a few tasks that need to be done to make sure that these work

  • Ensure that high-level dumps will dump this new data (If we have an existing high-level dump we need to make a new one including the new data)
  • Ensure that we compute high-level data for all old submissions (we currently don’t have a system to go back and compute high-level data for old submissions with a new model, the high-level extractor has to be improved to support this)

Update/fix some pages

We have a number of issues reported about unclear text on some pages and grammar that we can improve. Especially important are

  • API description (we should remove the documentation from the main website and just have a link to the ReadTheDocs page)
  • Front page (Show off what we have in the project in more detail, instead of just a wall of text)
  • Data page (instead of just showing tables of data, try and work out a better way of presenting the information that we have)

Fix Picard plugin

When AB was down during our migration we were serving HTML from our API pages, which caused Picard to crash if the AB plugin was enabled while trying to get AB data. This should be an easy fix in the Picard plugin.

High Impact

These are tasks that we want to complete first, that we know will have a high impact on the quality of the data that we produce.

Frame-level data

We want to extract and store more detailed information about our recordings. This relies on working being done in MTG to develop a new extractor to allow us to get more detailed information. It will also give us other improvements to data that we have in AB that we know is bad. This data is much bigger than our current data when stored in JSON (hundreds of times larger), so we need to develop a more efficient way of storing submissions. This could involve storing the data in a well-known binary data exchange format. A bunch of subtasks for this project:

  • Finish the essentia extractor software
  • Decide on how to store items on the server (file format, store on disk instead of database)
  • Work out a way to deal with features from two versions of the extractor (do we keep accepting old data? What happens if someone requests data for a recording for which we have the old extractor data but not the new one?)
  • Upgrade clients to support this (Change to HTTPS, change to the new API URL structure, ensure that clients check before submission if they’re the latest version, work out how to compress data or perform a duplicate check before submission)
  • Deduplication (If we have much larger data files, don’t bother storing 200 copies for a single Beatles song if we find that we already have 5-10 submissions that are all the same)

MusicBrainz Metadata

Rashi’s GSoC project in 2018 helped us to replicate parts of the MusicBrainz database into AcousticBrainz. This allows us to do amazing things like keep up-to-date information about MBID redirects, and do search/browse/filtering of data based on relationships such as Artists just by making a simple database query. We want to merge this work and start using it.

Dumps

When we changed the database architecture of AcousticBrainz in 2015 we stopped making data dumps, making people rely on using the API to retrieve data. This is not scalable, and many people have asked for this data. We want to fix all of the outstanding issues that we’ve found in the current dumps system and start producing periodic dumps for people to download.

Build more models

In addition to the existing models that we’ve already built (see above, “Publish our Existing models”), we have been collecting a lot of metadata that we could use to make even more high-level models which we think will have a value in the community. Build these models and publicly release them, using our current machine learning framework.

Wishlist

These are tasks that we want to complete that will show off the data that we have in AcousticBrainz and allow us to do more things with the data, but should come after the high-impact tasks.

Expose AB data on MusicBrainz

As part of the process to cross-pollinate the brainz’s, we want to be able to show a small subset of AB data that we trust on the MB website. This could include information such as BPM, Key, and results from some of our high-level models.

Improve music playback

On the detail page for recordings we currently have a simple YouTube player which tries to find a recording by doing text search. We want to improve the reliability and functionality of this player to include other playback services and take advantage of metadata that we already have in the MusicBrainz database.

Scikit-learn models

The future of machine learning is moving towards deep learning, and our current high-level infrastructure written in the custom Gaia project by MTG is preventing us from integrating improved machine learning algorithms to the data that we have. We would like to rewrite the training/evaluation process using scikit-learn, which is a well known Python library for general machine learning tasks. This will make it easier for us to take advantage of improvements in machine learning, and also make our environment more approachable to people outside the MusicBrainz community.

Dataset editor improvements

Part of the high-level/machine learning process involves making datasets that can be used to train models. We have a basic tool for building datasets, however it is difficult to use for making large datasets. We should look into ways of making this tool more useful for people who want to contribute datasets to AcousticBrainz.

Search

With the integration of the MusicBrainz database into AcousticBrainz, we will be able to let people search for metadata related to items which we know only exist in AcousticBrainz. We think that this is a good way for people to explore the data, and also for people to make new datasets (see above). We also want to provide a way that lets people search for feature data in the database (e.g. “all recordings in the key of Am, between 100 and 110BPM”).

API updates

As part of the 2018 MetaBrainz summit we decided to unify the structure of the APIs, including root path and versioning. We should make AcousticBrainz follow this common plan, while also supporting clients who still access the current API.

We should become more in-line with the MetaBrainz policy of API access, including user-agent reporting, rate limiting, and API key use.

Request specific data

Many services who use the API only need a very small bit of information from a specific recording, and so it’s often not efficient to return the entire low-level or high-level JSON document. It would be nice for clients to be able to request a specific field(s) for a recording. This ties in with the “Expose AcousticBrainz data on MusicBrainz” task above.

Everything else

Fix all our bugs and make AcousticBrainz an amazing open tool for MIR research.


Thanks for reading! If you have any ideas or requests for us to work on next please leave a comment here or on the forums.

MusicBrainz introducing: Genres!

One of the things various people have asked MusicBrainz for time and time again has been genres. However, genres are hard to do right and they’re very much subjective—with MusicBrainz dealing almost exclusively with objective data. It’s been a recurring discussion on almost all of our summits, but a couple years ago (with some help from our friend Alastair Porter and his research colleagues at UPF), we finally came to a path forward—and recently Nicolás Tamargo (reosarevok) finally turned that path forward into code… which has now been released! You can see it in action on e.g., Nine Inch Nails’ Year Zero release group.

How does it work?

For now genres are exactly the same as (folksonomy) tags behind the scenes; some tags simply have become the chosen ones and are listed and presented as genres. The list of which tags are considered as genres is currently hardcoded, and no doubt it is missing a lot of our users’ favourite genres. We plan to expand the genre list based on your requests, so if you find a genre that is missing from it, request it by adding a style ticket with the “Genres” component.

As we mentioned above, genres are very subjective, so just like with folksonomy tags, you can upvote and downvote genres you agree or disagree with on any given entity, and you can also submit genre(s) for the entity that no one has added yet.

What about the API?

A bunch of the people asking for genres in MusicBrainz have been application developers, and this type of people are usually more interested in how to actually extract the genres from our data.

The method to request genres mirrors that of tags: you can use inc=genres to get all the genres everyone has proposed for the entity, or inc=user-genres to get all the genres you have proposed yourself (or both!). For the same release group as before, you’d want https://musicbrainz.org/ws/2/release-group/3bd76d40-7f0e-36b7-9348-91a33afee20e?inc=genres+user-genres for the XML API and https://musicbrainz.org/ws/2/release-group/3bd76d40-7f0e-36b7-9348-91a33afee20e?inc=genres+user-genres&fmt=json for the JSON API.

Since genres are tags, all the genres will continue to be served with inc=tags as before as well. As such, you can always use the tag endpoint if you would rather filter the tags by your own genre list rather than follow the MusicBrainz one, or if you want to also get other non-genre tags (maybe you want moods, or maybe you’re really interested in finding artists who perform hip hop music and were murdered – we won’t stop you!).

I use the database directly, not the API

You can parse the taggenres from entities.json in the root of the “musicbrainz-server” repository which will give you a list of what we currently consider genres. Then you can simply compare any folksonomy tags from the %_tag tables.

Note about licensing

One thing to keep in mind for any data consumers out there is that, as per our data licensing, tags—and thus also genres—are not part of our “core (CC0-licensed) data”, but rather part of our “supplementary data” which is available under a Creative Commons Attribution-ShareAlike-NonCommercial license. Thus, if you wish to use our genre data for something commercial, you should get a commercial use license from the MetaBrainz Foundation. (Of course, if you’re going to provide a commercial product using data from MusicBrainz, you should always sign up as a supporter regardless. :)).

The future?

We are hoping to get a better coverage of genres (especially genres outside of the Western tradition, of which we have a very small amount right now) with your help! That applies both to expanding the genre list and actually applying genres to entities. For the latter, remember that everyone can downvote your genre suggestion if they don’t agree, so don’t think too much about “what genres does the world think apply to this artist/release/whatever”. Just add what you feel is right; if everyone does that we’ll get much better information. 🙂

In the near future we’re hoping to move the genre list from the code to the database (which shouldn’t mean too much for most of you, other than less waiting between a new genre being proposed for the list and it being added, but is much better for future development). Also planned is a way to indicate that several tags are the same genre (so that if you tag something as “hiphop”, “hip hop” or “hip-hop” the system will understand that’s really all the same). Further down the line, who knows! We might eventually make genres into limited entities of a sort, in order to allow linking to, say, the appropriate Wikidata/Wikipedia pages. We might do some fun stuff. Time will tell!

Server update, 2018-11-01

This release includes a first implementation of genres – expect more information as a blog post in the following days. The search results page has been converted to React for every type of search except the edit search, which is implemented separately. The homepage, the ISWC page and the sidebars have been converted too. Additionally, the password hashes have been strengthened, external URL handlers have been updated as usual, and ten bugs have been fixed. Thanks to issue reporters chirlu, darwinx0r, jesus2099, hibiscuskazeneko, ravenworks, spellew, yeeeargh, and zastai for their input.  The git tag is v-2018-11-01.

Sub-task

  • [MBS-9604] – Convert ISWC index page to React
  • [MBS-9813] – Convert the homepage to React
  • [MBS-9830] – Convert taglookup templates to React
  • [MBS-9850] – Convert sidebars to React
  • [MBS-9877] – Convert all search results pages to React

Bug

  • [MBS-8062] – Event indexed search doesn’t show artists/location/date
  • [MBS-9422] – Attach CD TOC page shows all tracks in tracklists as deleted
  • [MBS-9438] – Event dates and times display as “(Object object)” in drop-down search results
  • [MBS-9641] – Label-Series relationship does not show on series page
  • [MBS-9788] – Series edit history indicates wrong entity type
  • [MBS-9804] – Label dates do not appear on direct search
  • [MBS-9816] – Cannot change password if logged in with a different letter-cased username
  • [MBS-9817] – Regression: Client-side rendered components are not localized
  • [MBS-9846] – Beatport URL with numeric slug cannot be linked to
  • [MBS-9856] – Regression: Buggy formatting on some web search results

New Feature

  • [MBS-9492] – Add genres as a subset of tags

Task

  • [MBS-9208] – Increase bcrypt cost parameter
  • [MBS-9721] – Add GDPR links
  • [MBS-9763] – Normalize setlist.fm URLs to HTTPS
  • [MBS-9766] – Normalize worldcat.org URLs to HTTPS

Improvement

  • [MBS-9210] – Re-hash passwords on login
  • [MBS-9503] – Add a link to AcousticBrainz on the Details tab of recordings
  • [MBS-9764] – Allow adding events from places and artists