Moving AcousticBrainz to Hetzner

Hi, all. I worked on the recent migration of AcousticBrainz to the central Hetzner infrastructure that hosts all our other projects. It was a fun experience that I would like to share on this blog.

This was the first time I worked with a production database of this scale, and it was a real learning experience. It really felt like I had jumped off the deep end, but it was really fun!

For those that don’t know, AcousticBrainz is a music technology project which crowd sources acoustic information for music recordings and is a collaboration between the Music Technology Group at Universitat Pompeu Fabra and MetaBrainz. AcousticBrainz has already collected information about 3.7 million unique recordings and has individual submissions from users for over 11 million recordings.

All the data is stored in a single PostgreSQL database for now. The server that AcousticBrainz used to run on (we called it spike, after the Tom and Jerry character) had gotten old and started spitting out hard disk failure warnings, so we decided to move it to the central Hetzner infrastructure where other MetaBrainz projects are hosted.

We use Docker for all services running in Hetzner and it has worked pretty well for us so far. So the first task was creating a production Docker environment for AcousticBrainz. Consul is used to provide configuration values for the AcousticBrainz server which needed some new code and consul template files to be written. This is relatively simple stuff that did not take too long. We also have a repository to store all configuration values and scripts that need to be run on each of our servers. So I also wrote code to run the three different services that AcousticBrainz needs in different Docker containers.

After that, I started work on creating data dumps of the AcousticBrainz data. There was already some code that dumped the entire database into an lzma compressed file. However, it was old code that hadn’t been run in a long time and the database had gotten biiig since then. The way the code worked was that it dumped each table as a file into a directory and then added the entire directory at once into a tar file. However, this approach doesn’t work now, because the table that stores the low‐level JSON data that users submit to us has become too big to be stored in a single text file uncompressed. The lowlevel_json file has 11 million rows right now with each row containing a relatively large JSON document stored in a column of Postgres’ cool JSONB type. The table takes around 357 GB when stored inside Postgres and this balloons to much over the space we had on spike. So, I wrote some code that dumped 500,000 rows into a file and compressed it before dumping the next 500,000 rows.

The compressed AcousticBrainz data dump was around 169 GB in size which seemed reasonable. Then, I realized that the server we were planning to run the webserver on (called boingo, after Oingo Boingo) did not have enough storage space or computational power to hold and work with the database. This led to us getting a new shiny server called frank (after Frank Ocean!) which has a pretty big 7200 RPM hard disk and over a 100 GB of RAM. We also decided to upgrade to PostgreSQL 10 during the migration, which led me to creating a Docker image for PostgreSQL 10 that we could use in production.

After this, I imported the data into the empty Postgres server which worked pretty well. Everything seemed set for a small downtime for migration where we’d just create a small incremental data dump, move it to frank and import, bring spike down, bring the webserver up on frank and be done with it. The The steps were written up and we were ready to go.

Things started, I brought the site down on spike, created an incremental dump, imported it to frank. Everything worked. We decided to do a integrity check of the new database once before bringing the new site up. This is where the trouble started. The number of rows in one of the the tables was 10 million when it should have been around a 100M, yikes. We realized that there had been a bug in the original data dump code that
we’d written. It was a pretty small bug, the key we were using to dump the data was incorrect. One line fix. I thought that we need more tests for our data dumps code.

Well, at that point we decided to just go ahead and dump and import the table individually instead of stopping the whole process. The downtime was much longer than expected because of this, the table was pretty big, 100 million rows is no joke, it took pg_dump hours to dump it. Then, I dropped the table on frank and began an import of the dumped file. We had decided to not drop constraints before importing for sanity reasons, but that turned out to not be that good of an idea. It took the import 5–6 hours before it was even halfway done and the time to import new rows was increasing. We gave up, stopped the import and dropped all constraints before starting a new clean import. This worked much much faster and was done in around an hour. At that point, we did another sanity check of the database, before bringing the site back up.

Some static files like binaries and old dumps we linked to were still hosted on spike (another thing I missed!), so I had to whip up a quick pull request changing links temporarily. I was doing this at 3 in the morning and I had started working on this the previous day 11 in the morning. It was the longest, most intense production deployment I have ever done. Pretty fun—now that I think about it—but I was tired then.

Later, I set up an FTP server on frank and moved the static files we were hosting there.

There were a lot of things that I learned in this entire process. First thing was that we should really sanity check literally everything before bringing any production service down. Second thing was that importing data with constraints in a database (especially for large amounts of data) is not very feasible. Third is that this level of control is not something that I would ever get as a new grad in any big company. Being thrown off the deep end here at MetaBrainz was really awesome. Another thing that I forgot to mention was that the entire migration process was done remotely over IRC with me sitting in college in Hamirpur, India and my teammates in Barcelona. This really teaches efficient communication and teamwork.

In hindsight, there are a few things that I’d do differently given the chance again. I’d definitely have sanity checked the imported database before actually going through with the downtime. It would have saved a lot of pain and the downtime would have been much lower. This is the biggest thing I learned from the migration process. Sanity check as often as possible.

All in all, working with production grade big data projects has been pretty awesome, and I hope I continue to learn as much as possible as early as possible.

Import your listens to ListenBrainz from Spotify!

Hullo!

We’ve been working on a system to import listens automatically to ListenBrainz from Spotify and we’ve recently deployed it to the ListenBrainz beta site. We would really appreciate it if you could help us test it out!

Please note that this is still beta software, there is a (very small) chance that we might miss a listen or two. So if you’re using this, please make sure that ListenBrainz is not the only service where you’re archiving your listens.

Another thing to note is that importing the same listens from two different sources such as Last.FM and Spotify may cause the creation of duplicates in your listen history. If you opt into our automatic Spotify import, please do not use the Last.FM import or submit listens from other ListenBrainz clients. This is a temporary limitation while we find better ways to deduplicate listens.

That’s it for the caveats, please go ahead and use the new shiny Spotify Importer. And feel free to report bugs on tickets.metabrainz.org or on IRC in #metabrainz on Freenode.

Thanks!

GSoC 2017: Hacking on ListenBrainz

Namaste!

I am Param Singh, an undergraduate at the National Institute of Technology, Hamirpur, India, and I worked on ListenBrainz over the summer as part of the Google Summer of Code program. I started contributing code to ListenBrainz in January 2017 and have been working on new features and bug fixes since. I’ll be writing about the work I did and my experience working on LB in this blog post.

After a few of my patches had made it in and I was comfortable with the ListenBrainz codebase (which was a really nice example of software architecture for me), I talked with the LB team about what possible contributions I could make over the summer, and we decided that a Google BigQuery based statistics system is something that would be useful to have in ListenBrainz after we release a beta and have listen data that is permanently archived. I made a proposal for adding statistics to ListenBrainz which got accepted! During the community bonding period, we decided to try to get a solid and stable beta of ListenBrainz released before starting with the relatively large code additions that would be required by my project proposal. We tracked issues that we wanted fixed before a release in the MetaBrainz ticket tracker here. This work of fixing release blocking issues went into the coding period and we decided to continue working on a solid beta instead of adding new features for the time being.

I started with fixing bugs and adding new features to get a beta released as soon as possible. Some cool stuff I worked on during this time was dockerizing MessyBrainz (see PR here), migrating the codebases of MessyBrainz and ListenBrainz to Python 3 (PRs here and here) and improving the startup resilience of various parts of ListenBrainz to make sure that the server is able to self-heal (partially) if some part of it like RabbitMQ goes down (ticket here).

Later on, I did a big refactor of the LB code so that adding new modules would be easier in the future (PR here). I also spent a lot of time fixing bugs in our listen deduplication. Relevant pull requests for this are here and here.

Another feature I added to ListenBrainz while working on the beta was incremental imports. Earlier, LB didn’t keep track of previous imports of a user and did a full Last.FM import every time. However, now we keep track of the last time each user imported listens and only import new data since then. The PR adding incremental imports is here.

My mentor, Robert Kaye (ruaok) set up a test instance of the ListenBrainz server that was used by the community and as the community kept throwing their data at us, bugs kept popping up. A particularly weird bug caused LB to lose data for users with special characters in their usernames. The PR to fix this took a lot of time to create.

We kept on fixing bugs for a long time and the biggest thing I took away from this period of GSoC was the Ninety-ninety rule: «The first 90 percent of the code accounts for the first 90 percent of the development time. The remaining 10 percent of the code accounts for the other 90 percent of the development time.» This summer has drilled this into my mind.

As soon as the beta was released, I started with writing code for statistics, making schema changes (PR here) and adding some user stats (PRs here and here). I’ll be continuing on the stats work after Summer of Code. The basic foundation of stats is mostly done and soon I’ll start with showing statistics to the users.

By the end of the official GSoC coding period, I have made 266 commits in the ListenBrainz codebase and have opened a total of 111 pull requests. The current production ListenBrainz running on https://listenbrainz.org has 253 commits by me, most of which were made during the GSoC period.

Over the summer, I have fallen in love with the MetaBrainz community and have learned a lot of stuff. I’m really looking forward to adding more features to ListenBrainz soon, so that the data that the community is contributing becomes useful to everyone. I loved working on a really cool open-source project like ListenBrainz this summer and am very thankful to Google for providing me this opportunity. I would encourage everyone reading this to give the ListenBrainz beta a try and contribute to ListenBrainz if possible.