MetaBrainz Summit 2022

The silliest, and thus best, group photo from the summit. Left to right: Aerozol, Monkey, Mayhem, Atj, lucifer (laptop), yvanzo, alastairp, Bitmap, Zas, akshaaatt

After a two-year break, in-person summits made their grand return in 2022! Contributors from all corners of the globe visited the Barcelona HQ to eat delicious local food, sample Monkey and alastairp’s beer, marvel at the architecture, try Mayhem’s cocktail robot, savour New Zealand and Irish chocolates, munch on delicious Indian snacks, and learn about the excellent Spanish culture of sleeping in. As well as, believe it or not, getting “work” done – recapping the last year, and planning, discussing, and getting excited about the future of MetaBrainz and its projects.

We also had some of the team join us via Stream; Freso (who also coordinated all the streaming and recording), reosarevok, lucifer, rdswift, and many others who popped in. Thank you for patiently waiting while we ranted and when we didn’t notice you had your hand up. lucifer – who wasn’t able to come in person because of bullshit Visa rejections – we will definitely see you next year!

A summary of the topics covered follows. The more intrepid historians among you can see full event details on the wiki page, read the minutes, look at the photo gallery, and watch the summit recordings on YouTube: Day 1, Day 2, Day 3

OAuth hack session

With everyone together, the days before the summit proper were used for some productive hack sessions. The largest of which, involving the whole team, was the planning and beginning of a single OAuth location – meaning that everyone will be sent to a single place to login, from all of our projects.

A great warmup for the summit, we also leapt forward on the project, from identifying how exactly it would work, to getting substantial amounts of code and frontend elements in place.

Project recaps

“I broke this many things this year”

To kick off the summit, after a heart-warming introduction by Mayhem, we were treated to the annual recap for each project. For the full experience, feast your eyeballs on the Day 1 summit video – or click the timestamps below. What follows is a eyeball-taster, some simplistic and soothing highlights.

State of MetaBrainz (Mayhem) (4:50)

  • Mayhem reminds the team that they’re kicking ass!
  • We’re witnessing people getting fed up with streaming and focusing on a more engaged music experience, which is exactly the type of audience we wish to cater to, so this may work out well for us.
  • In 2023 we want to expand our offerings to grow our supporters (ListenBrainz)
  • Currently staying lean to prepare for incoming inflation/recession/depression

State of ListenBrainz (lucifer) (57:10)

  • 18.4 thousand all time users
  • 595 million all time listens
  • 92.3 million new listens submitted this year (so far)
  • Stacks of updates in the last year
  • Spotify metadata cache has been a game changer

State of Infrastructure (Zas) (1:14:40)

  • We are running 47 servers, from 42 in 2019
  • 27 physical (Hetzner), 12 virtual (Hetzner), 8 active instances (Google)
  • 150 Terabytes served this year
  • 99.9% availability of core services
  • And lots of detailed server, Docker, and ansible updates, and all the speed and response time stats you can shake a stick at.

State of MusicBrainz (Bitmap) (1:37:50)

  • React conversion coming along nicely
  • Documentation improved (auto-generated schema diagrams)
  • SIR crashes fixed, schema changes, stacks of updates (genres!)
  • 1,600 active weekly editors (stable from previous years)
  • 3,401,467 releases in the database
  • 391,536 releases added since 2021, ~1,099 per day
  • 29% of releases were added by the top 25 editors
  • 51% of releases were added with some kind of importer
  • 12,607,881 genre tag votes
  • 49% of release groups have at least one genre
  • 300% increase in the ‘finnish tango’ genre (3, was 1 in 2021)

State of AcousticBrainz (alastairp) (21:01:07)

  • R.I.P. (for more on the shut down of AB, see the blog post)
  • 29,460,584 submissions
  • 1.2 million hits per day still (noting that the level of trust/accuracy of this information is very low)
  • Data dumps, with tidying of duplicates, will be released when the site goes away

State of CritiqueBrainz (alastairp) (2:17:05)

  • 10,462 total reviews
  • 443 reviews in 2022
  • Book review support!
  • General bug squashing

State of BookBrainz (Monkey) (2:55:00)

  • A graph with an arrow going up is shown, everyone applauds #business #stonks
  • Twice the amount of monthly new users compared to 2021
  • 1/7th of all editions were added in the last year
  • Small team delivering lots of updates – author credits, book ratings/reviews, unified addition form
  • Import plans for the future (e.g. Library of Congress)

State of Community (Freso) (3:25:00)

  • Continuing discussion and developments re. how MetaBrainz affects LGBTQIA2+ folks
  • New spammer and sockpuppet countermeasures
  • Room to improve moderation and reports, particularly cross-project

Again, for delicious technical details, and to hear lots of lovely contributors get thanked, watch the full recording.

Discussions

“How will we fix all the things alastairp broke”

Next (not counting sleep, great meals, and some sneaky sightseeing) we moved to open discussion of various topics. These topics were submitted by the team, topics or questions intended to guide our direction for the next year. Some of these topics were discussed in break-out groups. You can read the complete meeting minutes in the summit minutes doc.

Ratings

Ratings were added years ago, and remain prominent on MusicBrainz. The topic for discussion was: What is their future? Shall we keep them? This was one of the most popular debates at the summit, with input from the whole spectrum of rating lovers and haters. In the end it was decided to gather more input from the community before making any decisions. We invite you to regale us with tales of your useage, suggestions, and thoughts in the resulting forum thread. 5/5 discussion.

CritiqueBrainz

Similar to ratings, CritiqueBrainz has been around for a number of years now and hasn’t gained much traction. Another popular topic, with lots of discussion regarding how we could encourage community submissions, improvements that could be made, how we can integrate it more closely with the other projects. Our most prolific CB contributor, sound.and.vision, gave some invaluable feedback via the stream. Ultimately it was decided that we are happy to sunset CB as a website (without hurry), but retain its API and integrate it into our other projects. Bug fixes and maintenance will continue, but new feature development will take place in other projects.

Integrating Aerozol (design)

Aerozol (the author of this blog post, in the flesh) kicked us off by introducing himself with a little TED talk about his history and his design strengths and weaknesses. He expressed interest in being part of the ‘complete user journey’, and helping to pull MetaBrainz’ amazing work in front of the general public, while being quite polite about MeB’ current attempts in this regard. It was decided that Aerozol should focus on over-arching design roadmaps that can be used to guide project direction, and that it is the responsibility of the developers to make sure new features and updates have been reviewed by a designer (including fellow designer, Monkey).

MusicBrainz Nomenclature

Can MetaBrainz sometimes be overly-fond of technical language? To answer that, ask yourself this: Did we just use the word ’nomenclature’ instead of something simpler, like ‘words’ or ‘terms’, in this section title? Exactly. With ListenBrainz aiming for a more general audience, who expect ‘album’ instead of ‘release group’, and ‘track’ instead of ‘recording’, this was predicted to become even more of an issue. Although it was acknowledged that it’s messy and generally unsatisfying to use different terms for the same things within the same ‘MetaBrainz universe’, we decided that it was fine for ListenBrainz to use more casual language for its user-facing bits, while retaining the technical language behind the scenes/in the API.

A related issue was also discussed, regarding how we title and discuss groupings of MusicBrainz entities, which is currently inconsistent, such as “core entities”, “primary entities”, “basic entities”. No disagreements with yvanzo’s suggestions were raised, the details of which can be found in ticket MBS-12552.

ListenBrainz Roadmap

Another fun discussion (5/5 – who said ratings weren’t useful!), it was decided that for 2023 we should prioritize features that bring in new users. Suggestions revolved around integrating more features into ListenBrainz directly (for instance, integrating MusicBrainz artist and album details, CritiqueBrainz reviews and ratings), how to promote sharing (please, share your thoughts and ideas in the resulting forum thread), making landing pages more inviting for new users, and how to handle notifications.

From Project Dev to Infrastructure Maintenance

MetaBrainz shares a common ‘tech org’ problem, stemming from working in niche areas which require high levels of expertise. We have many tasks that only one or a few people know how to do. It was agreed we should have another doc sprint, which was scheduled for the third week of January (16th-20th).

Security Management / Best Practices

Possible password and identity management solutions were discussed, and how we do, and should, deal with security advisories and alerts. It was agreed that there would be a communal security review the first week of each month. There is a note that “someone” should remember to add this to the meeting agenda at the right time. Let’s see how that pans out.

Search & SOLR

Did you know that running and calibrating search engines is a difficult Artform? Indeed, a capital a Artform. Our search team discussed a future move from SOLR v7 to SOLR v9 (SOLR is MusicBrainz’ search engine). It was discussed how we could use BookBrainz as a guinea pig by moving it from ElasticSearch (the search engine BB currently runs on) to SOLR, and try finally tackle multi-entity search while we are at it. If you really like reading about ‘cores’, ‘instances’, and whatever ‘zookeeper’ is, then these are your kind of meeting minutes.

Weblate

We currently use Transifex to translate MusicBrainz to other languages (Sound interesting? Join the community translation effort!), but are planning to move to Weblate, an open-source alternative that we can self-host. Pros and cons were discussed, and it seems that Weblate can provide a number of advantages, including discussion of translation strings, and ease of implementation across all our projects. Adjusting it to allow for single-sign on will involve some work. Video tutorials and introducing the new tool to the community was put on the to-do list.

Listenbrainz Roadmap and UI/UX

When a new user comes to ListenBrainz, where are they coming from, what do they see, where are we encouraging them to click next? Can users share and invite their friends? Items discussed were specific UI improvements, how we can implement ‘calls to action’, and better sharing tools (please contribute to the community thread if you have ideas). It was acknowledged that we sometimes struggle at implementing sharing tools because the team is (largely) not made up of social media users, and that we should allow for direct sharing as well as downloading and then sharing. Spotify, Apple Music, and Last.FM users were identified as groups that we should or could focus on.

Messages and Notifications

We agreed that we should have a way of notifying users across our sites, for site-user as well as user-user interactions. There should be an ‘inbox-like’ centre for these, and adequate granular control over the notification options (send me emails, digests, no emails, etc.), and the notification UI should show notifications from all MeB projects, on every site. We discussed how a messaging system could hinder or help our anti-spam efforts, giving users a new conduit to message each other, but also giving us possible control (as opposed to the current ‘invisible’ method of letting users direct email each other). It was decided to leave messaging for now (if at all), and focus on notifications.

Year in Music

We discussed what we liked (saveable images, playlists) and what we thought could be improved (lists, design, sharing, streamlining), about last years Year in Music. We decided that this year each component needs to have a link so that it can be embedded, as well as sharing tools. We decided to publish our Year in Music in the new year, with the tentative date of Wednesday January 4th, and let Spotify go to heck with their ’not really a year yet’ December release. We decided to use their December date to put up a blog post and remind people to get their listens added or imported in time for the real YIM!

Mobile Apps

The mobile app has been making great progress, with a number of substantial updates over the last year. However it seems to be suffering an identity crisis, with people expecting it to be a tagger on the level of Picard (or not really knowing what they expect), and then leaving bad reviews. After a lot of discussion (another popular and polarising topic!) it was agreed to make a new slimmed-down ListenBrainz app to cater to the ListenBrainz audience, and leave the troubled MusicBrainz app history behind. An iOS app isn’t out of the question, but something to be left for the future. akshaaatt has beaten me to the punch with his blog post on this topic.

MusicBrainz UI/UX Roadmap

The MusicBrainz dev and design team got together to discuss how they could integrate design and a broader roadmap into the workflow. It was agreed that designers would work in Figma (a online layout/mockup design tool), and developers should decide case-by-case whether an element should be standalone or shared among sites (using the design system). We will use React-Bootstrap for shared components. As the conversion to React continues it may also be useful to pull in designers to look at UI improvements as we go. It was agreed to hold regular team meetings to make sure the roadmap gets and stays on track and to get the redesign (!) rolling.

Thank you

Revealed! Left to right: Aerozol, Monkey, Mayhem, Atj, lucifer (laptop), yvanzo, alastairp, Bitmap, Zas, akshaaatt

On behalf of everyone who attended, a huge thanks to the wonderful denizens of Barcelona and OfficeBrainz for making us all feel so welcome, and MetaBrainz for making this trip possible. See you next year!

Acoustic similarity in AcousticBrainz

We’re pleased to announce that we have just released acoustic similarity in AcousticBrainz. Acoustic similarity is a technique to automatically identify which recordings sound similar to other recordings, using only the recordings themselves, and not any additional metadata. This feature is available via the AcousticBrainz API and the AcousticBrainz website, from any recording page. General documentation on acoustic similarity is available at https://acousticbrainz.readthedocs.io/similarity.html.

This feature is based on work started by Philip Tovstogan at the Music Technology Group, the research group that provides the essentia feature extractor that powers AcousticBrainz. The work was continued by Aidan Lawford-Wickham during Summer of Code 2019. Thanks Philip and Aidan for your work!

From the recording view on AcousticBrainz, you can choose to see similar recordings and choose which similarity metric you want to use. Then, a list of recordings similar to the initial recording will be shown.

These metrics are based on different musical features that the AcousticBrainz feature extractor identifies in the audio file. Some of these features are related to timbral characteristics (generally, what something sounds like), Rhythmic (related to tempo or perceived pulses), or AcousticBrainz’s high-level features (hybrid features that use our machine learning system to identify features such as genre, mood, or instrumentation).

One thing that we can immediately see in these results is that the same recording appears many times. This is because AcousticBrainz stores multiple different submissions for the same MBID, and will sometimes get submissions for the same recording with different MBIDs if the data in MusicBrainz is like this. This is actually really interesting! It shows us that we are successfully identifying that two different submissions in AcousticBrainz as being the same using only acoustic information and no metadata. Using the API you can ask to remove these duplicated MBIDs from the results, and we have some future plans to use MusicBrainz metadata to filter more of these results when needed.

What’s next?

We haven’t yet performed a thorough evaluation of the quality of these similarity results. We’d like people to use them and give us feedback on what they think. In the future we may look at performing some user studies in order to see if some specific features tend to give results that people consider “more” similar than others. AcousticBrainz has a number of additional features in our database, and we’d like to experiment with these to see if they can be used as similarity metrics as well.

The fact that we can identify the same recording as being similar even when the MusicBrainz ID is different is interesting. It could be useful to use this similarity to identify when two recordings could be merged in MusicBrainz.

The data files used for this similarity are stand-alone, and can be used without additional data from AcousticBrainz or MusicBrainz. We’re looking at ways that we can make these data files downloadable so that developers can use them without having to query the AcousticBrainz API. If you think that you might be interested in this, let us know!

Kartik Ohri joins the MetaBrainz team!

I’m pleased to announce that Kartik Ohri, AKA Lucifer, a very active contributor since his Code-in days in 2018, has become the latest staff member of the MetaBrainz Foundation!

Kartik has been instrumental in rewriting our Android app and more recently has been helping us with a number of tasks, including new features for ListenBrainz, AcousticBrainz as well as breathing some much needed life into the CritiqueBrainz project.

These three projects (CritiqueBrainz, ListenBrainz and AcousticBrainz) will be his main focus while working for MetaBrainz. Each of these projects has not had enough engineering time recently to sufficiently move new features forward. We hope that with Kartik’s efforts we can deliver more features faster.

Welcome to the team, Kartik!

MetaBrainz Projects in the news

During the summit this past weekend we talked about posting more updates to our blog. In the spirit of that, I wanted to share two articles where MusicBrainz and AcousticBrainz were recently mentioned in the news: In July the BBC wrote an article covering research from UC Irvine in California:

They found a significant downturn in the positivity of pop songs. Where 1985 saw upbeat tracks like Wham’s Freedom, 2015 favoured more sombre music by Sam Smith and Adele.
The UC Irvine research team analyzed the publicly available data from AcousticBrainz to arrive at this and several other conclusions.

This wasn’t the first time that our project was used to analyze music trends over time and we’re proud that researchers can carry out this kind of work on our public data. In October Insead Knowledge wrote about the gender gap in the music industry:

We also used song credit information from crowdsourced database MusicBrainz to determine how many women and men worked on the writing, production and performance of each song. . . . At first glance, our overall results appear quite simple. In line with past research on creativity, we find no baseline relationship between the novelty of the songs in our sample and the gender identity of the artists involved. Men and women appear to be equally capable in terms of creativity. But when we controlled for genre and, importantly, the gender composition of artists’ genres, the picture changed. Our methods were guided by an awareness that women in music work in a different context than men do: By a kind of gender-slanted gravitational pull, the music industry drives women into certain genres (e.g. pop) and collaborative networks.
We’ve long known about gender imbalances in the music industry and while we’re happy that people are using our data to demonstrate this, we’re dismayed at most of the findings in this article. What is more concerning is that we have a general impression that our community has a slight bias towards adding more information about music created by women, which means that the overall situation may actually be worse than what one can deduce from our data!

As a reminder, all data in MusicBrainz is contributed by members of the community. If you see any situations where women or minorities are being mis- or underrepresented, we encourage you to add this content to MusicBrainz. And if you get stuck, don’t hesitate to ask for help on the forums.

State of the Brainz: 2019 MetaBrainz Summit highlights

The 2019 MetaBrainz Summit took place on 27th–29th of September 2019 in Barcelona, Spain at the MetaBrainz HQ. The Summit is a chance for MetaBrainz staff and the community to gather and plan ahead for the next year. This report is a recap of what was discussed and what lies ahead for the community.

Continue reading “State of the Brainz: 2019 MetaBrainz Summit highlights”

GSoC 2019: Recording Similarity Indexing for AcousticBrainz

For Starters… Who Am I?

My name is Aidan Lawford-Wickham, better known as aidanlw17 on IRC, and I’m entering my second year of undergraduate study in Engineering Science at the University of Toronto. This summer, I had the opportunity to participate in my first Google Summer of Code with the MetaBrainz Foundation. Working on the AcousticBrainz project under the mentorship of Alastair Porter (alastairp), I used previous work on measuring track to track similarity as the basis for a similarity pipeline using the entire AB database.

How Did I Get Involved?

When I started applying for GSoC, I needed to find an organization that paired a challenging learning environment with a project of personal interest. Given my own passion for listening to music, playing music, and exploring its overlap with culture, MetaBrainz quickly became my top priority. I jumped on the #metabrainz IRC channel for the first time, and I’ve been active daily ever since!

From there, the whole community welcomed me with open arms and responded thoughtfully to my questions about setting up my local development environment. I made my first pull request for AcousticBrainz, AB-387, which added the ability to include dataset and class descriptions when importing datasets as CSV files. This allowed me to work alongside my soon-to-be mentor for the first time and further acquaint myself with the acousticbrainz-server source code.

I was excited about my first PR and wanted to contribute more. Not only was this a project related to my passions, but it had already begun to teach me about technologies that I hadn’t used before. I was struck by the possibility to contribute more, and work with great people on a non-profit, open source project. I quickly decided that MetaBrainz was the only place I would apply for GSoC and began to think about proposals. I read through the previous work on recording similarity done by Philip Tovstogan, which was based upon a PostgreSQL solution with shortcomings in terms of speed. With a strong supporting background, high community interest, and my own dreams of the possibilities to come from predicting similar tracks, I created a proposal to build a similarity pipeline using Spotify’s nearest neighbours library, Annoy. The timeline and tasks shown on the full proposal were adjusted throughout the summer, but the general objectives were maintained. Looking back on the summer now, the basic requirements for the project were as such:

  • Using the previous work, define metrics for measuring similarity that will translate recording features from the AB database into vectors. Compute and store these vectors for every recording in the database.
  • Create an Annoy index for each of these metrics, adding the metric’s vector for each recording to the index.
  • Develop methods of querying an index, such as outputting nearest neighbours (similar recordings) to a specific recording or many recordings, or finding the similarity between two recordings.
  • Allow users to query the indices via an API.
  • Create an evaluation that allows us to measure the success of our indices in the public eye, fine tune our parameters, and display index queries via a graphical user interface.

Community Bonding Period

After losing sleep before the announcement, and a huge sigh of relief on May 6th, I was ecstatic to get started.

There was plenty of required reading, and I familiarized myself with the different elements of building similarity into AB. After discussing with Rob (ruaok) and Alastair and cementing our decision to use Annoy as the nearest neighbours algorithm of choice, I took to reading through Annoy documentation and making a small implementation to grasp the concepts. Annoy works blazing fast, and uses small, static files – these are points that would prove advantageous for us in terms of querying indices many times, as quickly as possible. Static index files allow for them to be shared across processes and could potentially make them simple to redistribute to others in the future – a major benefit for further similarity research.

I studied Philip’s previous work, gained an understanding of the metrics he used in his thesis, and reimplemented all of his code to better grasp the concepts and use them as a basis for the summer. Much of Philip’s work was built to be easily expandable, and flexible to different types of metrics. Notably, when integrating it with a full pipeline including Annoy, priorities like speed meant that we lost some of this flexibility. I found this to be an interesting contrast between the code structure for an ongoing research purpose, and the code ready to be deployed in production on a website.

All the while, I kept a frequent dialogue with Alastair to gel as a team, clarify issues with the codebase, and further develop our plans for the pipeline. To build on my development skills, learn more about contributing guidelines and source control, and improve the site, I worked on some exciting PRs during the bonding period. Most notably, I completed AB-406 over a series of 3 PRs, which allowed us to introduce a submission offset column in the low-level table to handle multiple submissions of a single recording. This reduced the need for complexity in queries to the API, decreasing the load on the server. Additionally, I added some documentation related to contributions and created an API endpoint that would allow users to only select specific features rather than an entire low-level document for a recording – aiming at reducing server load.

Last but not least, I got really involved with the weekly meetings at MB! We have meetings every Monday on #metabrainz to give reviews of the last week, and discuss any other important community topics. I love this aspect of the community. Working remotely, it creates a strong team atmosphere and brings us all a bit closer together – even if we’re living time zones apart. During one meeting, we discussed whether or not past GSoC proposals should be available to students. What do you think? This prompted me to share my own experience with the application process at MetaBrainz and look into if/how we could improve it.

… And so it began, we dove into the first coding period.

The Key Components, a Deeper Look

Computing Similarity Metrics

Having explored the previous similarity work from Philip, I used his definitions of metric classes and focused on developing a script to compute metrics for each recording in the database incrementally. Recognizing that we would also need a method of computing metrics for a single recording on submission, I made this script as open ended as possible. After successfully computing all metrics for the first time, we went through an iterative process of altering the logic and methodology to dramatically improve its speed. Ultimately, we used a query to get the batch of low-level recordings that haven’t had similarity computations, complete with their low-level data and all high-level models. Though we revised and found bugs in this script time and time again, I’m confident in saying that with perseverance we finally got it working.

Prior to the beginning of the project I had limited experience working with SQL databases, and this objective pushed me to develop new ways to approach problems, and gave me a much deeper understanding of PostgreSQL.

Building Annoy Indices

With all that vectorized recording data from the metrics computation, nothing sounds better than adding it to an ultra-fast index built for querying nearest neighbours! Feeding the data into an index and watching it output similar recordings in milliseconds became the most satisfying feeling. The Annoy library is a platform for nearest neighbours of all sorts, and it is generally simple: define the index, add items with an identifier and a vector, built the index, save it for later use, load it up, and then use its built-in methods to query for similar items. Easy, right? The added challenge is making this interface with recordings from our database as items, and meeting our needs in terms of speed and alterability when new items are added. Annoy is built without checks in many places, and we required a custom cycle of building, loading, and saving indices to ensure they were operable for our purposes (once an index is built, new items may not be added). At this point, the index model is open to saving new indices with different parameters, which allows us to tune as we further develop the pipeline.

After wrapping the index in a class that interfaced with our needs, we added scripts to build all indices and save them, and scripts to remove indices if need be. Currently, the project has 12 indices, one for each metric in use:

  • MFCCs
  • Weighted MFCCs
  • GFCCs
  • Weighted GFCCs
  • BPM
  • Key
  • Onset Rate
  • Moods
  • Instruments
  • Dortmund
  • Rosamerica
  • Tzanetakis

API Endpoints

Making API endpoints available was a high priority activity and was an exciting aspect of the project since it would allow users to interact with the data provided by a similarity pipeline. Using the index model, I created three API endpoints:

  • Get the n most similar recordings to a recording specified by an (MBID, offset) combination.
  • Get the n most similar recordings to a number of recordings that are specified (bulk endpoint).
  • Get the distance between two recordings.

For each endpoint, a parameter indicates the metric in question, determining which index should be used. Currently, the endpoints also allow varying index parameters, such as the distance type (method of distance calculation) and number of trees used in building the index (precision increases with trees, while speed decreases).

A full explanation of the API endpoints is documented in the source code.

Baseline Evaluation

As I said, an index can be altered using multiple parameters that impact the build speed, query speed, and precision in finding nearest neighbours. Assessing the query results from our indices with public opinion is a top priority, since it gives us valuable data for understanding the quality of similarity predictions. With the evaluation we will be able to collect feedback from the community on a set of similar recordings – do they seem accurate, or should a recording have been more or less similar? What recording do you think is the most similar? With this sort of feedback, we can measure the success of different parameters for Annoy, eventually optimizing our results.

Moreover, this form of evaluation provides a graphical user interface to interact with similar recordings, as a user-friendly alternative to the API endpoints. Written using React, it feels snappy and fast, and I feel that it provides a pleasing display of similar recordings. At this point in the project I was glad to accept a frontend challenge which differed from the bulk of my work thus far.

Documentation and Project Links

Similarity pipeline related:

Additional work:

Going Forward

This summer allowed for us to build on previous similarity work to the point of developing a fast, full pipeline. At this point, there is still a vast amount of work to be continued on the pipeline and I am eager to see it through. In the upcoming year I plan to continue contributing to AcousticBrainz and the MetaBrainz Foundation as a whole. These are areas that I’m interested in continuing to develop for the recording similarity pipeline:

  • Parameter tuning on Annoy indices
  • Adding more metrics to cover other recording features
  • Adding support for hybrid metrics that consider multiple features (this was started by Philip and should be integrated to provide more holistic similarity)
  • Making indices available for offline use
  • Creating statistics and visualizations of vectors for each metric

Wrapping Up

To say the least, this has been a highly rewarding experience. MetaBrainz is a community full of extraordinary, thoughtful, and friendly developers and enthusiasts. I will be forever thankful for this opportunity and the lessons that I gained this summer. I am so excited to meet everyone at the summit this September! I’d like to personally thank my mentor, Alastair Porter (alastairp), for his perceptive guidance, his support, his friendship, and his own contributions to the project. Thanks to Robert Kaye (ruaok) for his support, thoughts, and enthusiasm towards this project, as well as for his dedication to MetaBrainz. Thanks to Google for making this all possible – SoC is a highly unique opportunity to learn about open source software and make new connections! Cheers.

Moving AcousticBrainz to Hetzner

Hi, all. I worked on the recent migration of AcousticBrainz to the central Hetzner infrastructure that hosts all our other projects. It was a fun experience that I would like to share on this blog.

This was the first time I worked with a production database of this scale, and it was a real learning experience. It really felt like I had jumped off the deep end, but it was really fun!

For those that don’t know, AcousticBrainz is a music technology project which crowd sources acoustic information for music recordings and is a collaboration between the Music Technology Group at Universitat Pompeu Fabra and MetaBrainz. AcousticBrainz has already collected information about 3.7 million unique recordings and has individual submissions from users for over 11 million recordings.

All the data is stored in a single PostgreSQL database for now. The server that AcousticBrainz used to run on (we called it spike, after the Tom and Jerry character) had gotten old and started spitting out hard disk failure warnings, so we decided to move it to the central Hetzner infrastructure where other MetaBrainz projects are hosted.

We use Docker for all services running in Hetzner and it has worked pretty well for us so far. So the first task was creating a production Docker environment for AcousticBrainz. Consul is used to provide configuration values for the AcousticBrainz server which needed some new code and consul template files to be written. This is relatively simple stuff that did not take too long. We also have a repository to store all configuration values and scripts that need to be run on each of our servers. So I also wrote code to run the three different services that AcousticBrainz needs in different Docker containers.

After that, I started work on creating data dumps of the AcousticBrainz data. There was already some code that dumped the entire database into an lzma compressed file. However, it was old code that hadn’t been run in a long time and the database had gotten biiig since then. The way the code worked was that it dumped each table as a file into a directory and then added the entire directory at once into a tar file. However, this approach doesn’t work now, because the table that stores the low‐level JSON data that users submit to us has become too big to be stored in a single text file uncompressed. The lowlevel_json file has 11 million rows right now with each row containing a relatively large JSON document stored in a column of Postgres’ cool JSONB type. The table takes around 357 GB when stored inside Postgres and this balloons to much over the space we had on spike. So, I wrote some code that dumped 500,000 rows into a file and compressed it before dumping the next 500,000 rows.

The compressed AcousticBrainz data dump was around 169 GB in size which seemed reasonable. Then, I realized that the server we were planning to run the webserver on (called boingo, after Oingo Boingo) did not have enough storage space or computational power to hold and work with the database. This led to us getting a new shiny server called frank (after Frank Ocean!) which has a pretty big 7200 RPM hard disk and over a 100 GB of RAM. We also decided to upgrade to PostgreSQL 10 during the migration, which led me to creating a Docker image for PostgreSQL 10 that we could use in production.

After this, I imported the data into the empty Postgres server which worked pretty well. Everything seemed set for a small downtime for migration where we’d just create a small incremental data dump, move it to frank and import, bring spike down, bring the webserver up on frank and be done with it. The The steps were written up and we were ready to go.

Things started, I brought the site down on spike, created an incremental dump, imported it to frank. Everything worked. We decided to do a integrity check of the new database once before bringing the new site up. This is where the trouble started. The number of rows in one of the the tables was 10 million when it should have been around a 100M, yikes. We realized that there had been a bug in the original data dump code that
we’d written. It was a pretty small bug, the key we were using to dump the data was incorrect. One line fix. I thought that we need more tests for our data dumps code.

Well, at that point we decided to just go ahead and dump and import the table individually instead of stopping the whole process. The downtime was much longer than expected because of this, the table was pretty big, 100 million rows is no joke, it took pg_dump hours to dump it. Then, I dropped the table on frank and began an import of the dumped file. We had decided to not drop constraints before importing for sanity reasons, but that turned out to not be that good of an idea. It took the import 5–6 hours before it was even halfway done and the time to import new rows was increasing. We gave up, stopped the import and dropped all constraints before starting a new clean import. This worked much much faster and was done in around an hour. At that point, we did another sanity check of the database, before bringing the site back up.

Some static files like binaries and old dumps we linked to were still hosted on spike (another thing I missed!), so I had to whip up a quick pull request changing links temporarily. I was doing this at 3 in the morning and I had started working on this the previous day 11 in the morning. It was the longest, most intense production deployment I have ever done. Pretty fun—now that I think about it—but I was tired then.

Later, I set up an FTP server on frank and moved the static files we were hosting there.

There were a lot of things that I learned in this entire process. First thing was that we should really sanity check literally everything before bringing any production service down. Second thing was that importing data with constraints in a database (especially for large amounts of data) is not very feasible. Third is that this level of control is not something that I would ever get as a new grad in any big company. Being thrown off the deep end here at MetaBrainz was really awesome. Another thing that I forgot to mention was that the entire migration process was done remotely over IRC with me sitting in college in Hamirpur, India and my teammates in Barcelona. This really teaches efficient communication and teamwork.

In hindsight, there are a few things that I’d do differently given the chance again. I’d definitely have sanity checked the imported database before actually going through with the downtime. It would have saved a lot of pain and the downtime would have been much lower. This is the biggest thing I learned from the migration process. Sanity check as often as possible.

All in all, working with production grade big data projects has been pretty awesome, and I hope I continue to learn as much as possible as early as possible.

AcousticBrainz at the 2018 MetaBrainz Summit

We had an in-person meeting at the MTG during the MetaBrainz summit to discuss the status and future of AcousticBrainz. We came up with a rough outline of things that we want to work on over the next year or so. This is a small list of tasks that we think will have a good impact on the image of AcousticBrainz and encourage people to use our data more.

State of AcousticBrainz

AcousticBrainz has a huge database of submissions (over 10 million now, thanks everyone!), but we are currently not using the wealth of data to our advantage. For the last year we’ve not had a core developer from MetaBrainz or MTG working on existing or new features in AcousticBrainz. However, we now have:

  • Param, who is including AcousticBrainz in his role with MetaBrainz
  • Rashi, who worked on AcousticBrainz for GSoC and is going to continue working with us
  • Philip, who is starting a PhD at MTG, focused on some of the algorithms/data going into AcousticBrainz
  • Alastair, who now has more time to put towards management of the project

Because of this, we’re glad to present an outline of our next tasks for AcousticBrainz:

Short-term

Some small tasks that are quick to finish and we can use to show off uses of the data in AcousticBrainz

Merge Philip’s similarity, including an API endpoint

Philip’s masters thesis project from last year uses PostgreSQL search to find acoustically similar recordings to a target recording. This uses the features in AcousticBrainz. We need to ensure that PostgreSQL can handle the scale of data that we have.

An extension of this work is to use the similarity to allow us to remove bad duplicate submissions (we can take all recordings with the same MBID and see if they are similar to each other, if one is not similar we can assume that it’s not actually the same as the other duplicates, and mark it as bad). We want to make these results available via an API too, so that others can check this information as well.

Merge Existing PRs

We have many great PRs from various people which Alastair didn’t merge over the last year. We’re going to spend some time getting these patches merged to show that we’re open to contributions!

Publish our Existing models

In research at MTG we’ve come up with a few more detailed genre models based on tag/genre data that we’ve collected from a number of sources. We believe that these models can be more useful that the current genre models that we have. The AcousticBrainz infrastructure supports adding new models easily, so we should spend some time integrating these. There are a few tasks that need to be done to make sure that these work

  • Ensure that high-level dumps will dump this new data (If we have an existing high-level dump we need to make a new one including the new data)
  • Ensure that we compute high-level data for all old submissions (we currently don’t have a system to go back and compute high-level data for old submissions with a new model, the high-level extractor has to be improved to support this)

Update/fix some pages

We have a number of issues reported about unclear text on some pages and grammar that we can improve. Especially important are

  • API description (we should remove the documentation from the main website and just have a link to the ReadTheDocs page)
  • Front page (Show off what we have in the project in more detail, instead of just a wall of text)
  • Data page (instead of just showing tables of data, try and work out a better way of presenting the information that we have)

Fix Picard plugin

When AB was down during our migration we were serving HTML from our API pages, which caused Picard to crash if the AB plugin was enabled while trying to get AB data. This should be an easy fix in the Picard plugin.

High Impact

These are tasks that we want to complete first, that we know will have a high impact on the quality of the data that we produce.

Frame-level data

We want to extract and store more detailed information about our recordings. This relies on working being done in MTG to develop a new extractor to allow us to get more detailed information. It will also give us other improvements to data that we have in AB that we know is bad. This data is much bigger than our current data when stored in JSON (hundreds of times larger), so we need to develop a more efficient way of storing submissions. This could involve storing the data in a well-known binary data exchange format. A bunch of subtasks for this project:

  • Finish the essentia extractor software
  • Decide on how to store items on the server (file format, store on disk instead of database)
  • Work out a way to deal with features from two versions of the extractor (do we keep accepting old data? What happens if someone requests data for a recording for which we have the old extractor data but not the new one?)
  • Upgrade clients to support this (Change to HTTPS, change to the new API URL structure, ensure that clients check before submission if they’re the latest version, work out how to compress data or perform a duplicate check before submission)
  • Deduplication (If we have much larger data files, don’t bother storing 200 copies for a single Beatles song if we find that we already have 5-10 submissions that are all the same)

MusicBrainz Metadata

Rashi’s GSoC project in 2018 helped us to replicate parts of the MusicBrainz database into AcousticBrainz. This allows us to do amazing things like keep up-to-date information about MBID redirects, and do search/browse/filtering of data based on relationships such as Artists just by making a simple database query. We want to merge this work and start using it.

Dumps

When we changed the database architecture of AcousticBrainz in 2015 we stopped making data dumps, making people rely on using the API to retrieve data. This is not scalable, and many people have asked for this data. We want to fix all of the outstanding issues that we’ve found in the current dumps system and start producing periodic dumps for people to download.

Build more models

In addition to the existing models that we’ve already built (see above, “Publish our Existing models”), we have been collecting a lot of metadata that we could use to make even more high-level models which we think will have a value in the community. Build these models and publicly release them, using our current machine learning framework.

Wishlist

These are tasks that we want to complete that will show off the data that we have in AcousticBrainz and allow us to do more things with the data, but should come after the high-impact tasks.

Expose AB data on MusicBrainz

As part of the process to cross-pollinate the brainz’s, we want to be able to show a small subset of AB data that we trust on the MB website. This could include information such as BPM, Key, and results from some of our high-level models.

Improve music playback

On the detail page for recordings we currently have a simple YouTube player which tries to find a recording by doing text search. We want to improve the reliability and functionality of this player to include other playback services and take advantage of metadata that we already have in the MusicBrainz database.

Scikit-learn models

The future of machine learning is moving towards deep learning, and our current high-level infrastructure written in the custom Gaia project by MTG is preventing us from integrating improved machine learning algorithms to the data that we have. We would like to rewrite the training/evaluation process using scikit-learn, which is a well known Python library for general machine learning tasks. This will make it easier for us to take advantage of improvements in machine learning, and also make our environment more approachable to people outside the MusicBrainz community.

Dataset editor improvements

Part of the high-level/machine learning process involves making datasets that can be used to train models. We have a basic tool for building datasets, however it is difficult to use for making large datasets. We should look into ways of making this tool more useful for people who want to contribute datasets to AcousticBrainz.

Search

With the integration of the MusicBrainz database into AcousticBrainz, we will be able to let people search for metadata related to items which we know only exist in AcousticBrainz. We think that this is a good way for people to explore the data, and also for people to make new datasets (see above). We also want to provide a way that lets people search for feature data in the database (e.g. “all recordings in the key of Am, between 100 and 110BPM”).

API updates

As part of the 2018 MetaBrainz summit we decided to unify the structure of the APIs, including root path and versioning. We should make AcousticBrainz follow this common plan, while also supporting clients who still access the current API.

We should become more in-line with the MetaBrainz policy of API access, including user-agent reporting, rate limiting, and API key use.

Request specific data

Many services who use the API only need a very small bit of information from a specific recording, and so it’s often not efficient to return the entire low-level or high-level JSON document. It would be nice for clients to be able to request a specific field(s) for a recording. This ties in with the “Expose AcousticBrainz data on MusicBrainz” task above.

Everything else

Fix all our bugs and make AcousticBrainz an amazing open tool for MIR research.


Thanks for reading! If you have any ideas or requests for us to work on next please leave a comment here or on the forums.

AcousticBrainz migration: We’re on!

We had to postpone the migration of AcousticBrainz last week since we ran out of time (our database is getting to be sizable!). We’ve migrated the bulk of our data and are now ready to move the last bits and call the move complete.

Downtime will start very soon — follow us on twitter for more detailed updates.

AcousticBrainz downtime: Migrating hosting to our other servers

Today we’re going to migrate the AcousticBrainz service from its standalone server that we’ve rented in the past few years to our shared infrastructure at Hetzner. We’ve been prepping for this move for a few weeks now and the actual process to follow has been used before, so we don’t expect the downtime to be more than 1 hour.

We’re sorry for the downtime that will be coming — to keep up with what we’re doing, please follow our progress on Twitter. We hope to start the migration in the next hour or two from when this blog post goes up.