Data – MetaBrainz Blog

The Strangest Releases in MusicBrainz: Weird and Wonderful

MusicBrainz is a treasure trove. Open the lid and you will find glittering piles of release metadata. Mountains of precious artist information. Gold nuggets of high resolution artwork. Everything you can imagine – provided you mainly imagine music data. And also, at the bottom, tucked into the corners, some really weird shit. And some MusicBrainz editors just having the time of their life, adding to the pile. One of those editors has agreed to have a chat with me.

A photo of a dungbeetle on top of a ball of dung. A lot of strange audio releases are sticking out of the dung ball. — One person’s treasure is another person’s…

Thank you for answering my questions, sound.and.vision! I am really regretting the poop analogy, my deepest apologies. But we are dung-beetle editing buddies, and I know that your additions to the database are a treasure to many. Shall we ‘roll’ with it?

New dataset: MusicBrainz Canonical Metadata

The MusicBrainz project is proud to announce the release of our latest dataset: MusicBrainz Canonical Metadata. This geeky sounding dataset packs an intense punch! It solves a number of problems involving how to match a piece of music metadata to the correct entry in the massive MusicBrainz database.

The MusicBrainz database aims to collect metadata for all releases (albums) that have ever been published. For popular albums, there can be many different releases, which begs the question “which one is the main (canonical) release?”. If you want to identify a piece of metadata, and you only have an artist and recording (track) name, how do you choose the correct database release?

The ODI publishes two reports on Sustainable Data Institutions

The Open Data Institute has just published two reports: Designing Sustainable Data Institutions and Designing Trustworthy Data Institutions which include insights provided by us regarding our MusicBrainz project.

When I was starting out MusicBrainz and was trying to work out how to make the project sustainable, I would’ve given just about anything to have access to these reports. I am proud that, nearly 20 years later, I was able to contribute to these reports so that others may benefit from our hard work.

I find the section Suggestions for those scoping, designing and running data institutions on page 40 of the PDF version of Designing Sustainable Data Institutions quite enlightening:

Ensure your revenue model aligns with your organisational goals
Understand how your revenue sources will change during your institution’s lifecycle
Consider both financial and non-financial aspects of sustainability
Identify and mitigate future risks
Learn from others

Each of these points represent a whole collections of small lessons that I’ve learned by (often painful) experience of the past years. Also, I feel that these points are not strictly limited Data Institutions, but many also apply to making open source projects sustainable. If you’re in the business of running a data or open source organzation, I would strongly encourage you to read this paper!

Also very interesting is the second report about Designing Trustworthy Data Institutions:

For example, the representative from MusicBrainz said, “[A culture of honesty] builds trust, and this trust builds sustainability”

Compared to sustainability, the concepts of trust were much more clear to me from the beginning. However, that doesn’t make this report any less relevant — especially in current times, I welcome an emphasis on trust!

Thank you to the ODI for including MusicBrainz and doing all of the hard work on these reports!

Please nominate us for the Open Publishing Awards!

We’ve recently found out about the Open Publishing Awards::

The goal of the inaugural Open Publishing Awards is to promote and celebrate a wide variety of open projects in Publishing.

…

All content types emanating from the Publishing sector are eligible including Open Access articles, open monographs, Open Educational Resource Materials, open data, open textbooks etc.

Open data? That’s us! We’ve got a pile of it and if you like the work we do, why not nominate us for an award?

Thanks!

AcousticBrainz at the 2018 MetaBrainz Summit

We had an in-person meeting at the MTG during the MetaBrainz summit to discuss the status and future of AcousticBrainz. We came up with a rough outline of things that we want to work on over the next year or so. This is a small list of tasks that we think will have a good impact on the image of AcousticBrainz and encourage people to use our data more.

State of AcousticBrainz

AcousticBrainz has a huge database of submissions (over 10 million now, thanks everyone!), but we are currently not using the wealth of data to our advantage. For the last year we’ve not had a core developer from MetaBrainz or MTG working on existing or new features in AcousticBrainz. However, we now have:

Param, who is including AcousticBrainz in his role with MetaBrainz
Rashi, who worked on AcousticBrainz for GSoC and is going to continue working with us
Philip, who is starting a PhD at MTG, focused on some of the algorithms/data going into AcousticBrainz
Alastair, who now has more time to put towards management of the project

Because of this, we’re glad to present an outline of our next tasks for AcousticBrainz:

Short-term

Some small tasks that are quick to finish and we can use to show off uses of the data in AcousticBrainz

Merge Philip’s similarity, including an API endpoint

Philip’s masters thesis project from last year uses PostgreSQL search to find acoustically similar recordings to a target recording. This uses the features in AcousticBrainz. We need to ensure that PostgreSQL can handle the scale of data that we have.

An extension of this work is to use the similarity to allow us to remove bad duplicate submissions (we can take all recordings with the same MBID and see if they are similar to each other, if one is not similar we can assume that it’s not actually the same as the other duplicates, and mark it as bad). We want to make these results available via an API too, so that others can check this information as well.

Merge Existing PRs

We have many great PRs from various people which Alastair didn’t merge over the last year. We’re going to spend some time getting these patches merged to show that we’re open to contributions!

Publish our Existing models

In research at MTG we’ve come up with a few more detailed genre models based on tag/genre data that we’ve collected from a number of sources. We believe that these models can be more useful that the current genre models that we have. The AcousticBrainz infrastructure supports adding new models easily, so we should spend some time integrating these. There are a few tasks that need to be done to make sure that these work

Ensure that high-level dumps will dump this new data (If we have an existing high-level dump we need to make a new one including the new data)
Ensure that we compute high-level data for all old submissions (we currently don’t have a system to go back and compute high-level data for old submissions with a new model, the high-level extractor has to be improved to support this)

Update/fix some pages

We have a number of issues reported about unclear text on some pages and grammar that we can improve. Especially important are

API description (we should remove the documentation from the main website and just have a link to the ReadTheDocs page)
Front page (Show off what we have in the project in more detail, instead of just a wall of text)
Data page (instead of just showing tables of data, try and work out a better way of presenting the information that we have)

Fix Picard plugin

When AB was down during our migration we were serving HTML from our API pages, which caused Picard to crash if the AB plugin was enabled while trying to get AB data. This should be an easy fix in the Picard plugin.

High Impact

These are tasks that we want to complete first, that we know will have a high impact on the quality of the data that we produce.

Frame-level data

We want to extract and store more detailed information about our recordings. This relies on working being done in MTG to develop a new extractor to allow us to get more detailed information. It will also give us other improvements to data that we have in AB that we know is bad. This data is much bigger than our current data when stored in JSON (hundreds of times larger), so we need to develop a more efficient way of storing submissions. This could involve storing the data in a well-known binary data exchange format. A bunch of subtasks for this project:

Finish the essentia extractor software
Decide on how to store items on the server (file format, store on disk instead of database)
Work out a way to deal with features from two versions of the extractor (do we keep accepting old data? What happens if someone requests data for a recording for which we have the old extractor data but not the new one?)
Upgrade clients to support this (Change to HTTPS, change to the new API URL structure, ensure that clients check before submission if they’re the latest version, work out how to compress data or perform a duplicate check before submission)
Deduplication (If we have much larger data files, don’t bother storing 200 copies for a single Beatles song if we find that we already have 5-10 submissions that are all the same)

MusicBrainz Metadata

Rashi’s GSoC project in 2018 helped us to replicate parts of the MusicBrainz database into AcousticBrainz. This allows us to do amazing things like keep up-to-date information about MBID redirects, and do search/browse/filtering of data based on relationships such as Artists just by making a simple database query. We want to merge this work and start using it.

Dumps

When we changed the database architecture of AcousticBrainz in 2015 we stopped making data dumps, making people rely on using the API to retrieve data. This is not scalable, and many people have asked for this data. We want to fix all of the outstanding issues that we’ve found in the current dumps system and start producing periodic dumps for people to download.

Build more models

In addition to the existing models that we’ve already built (see above, “Publish our Existing models”), we have been collecting a lot of metadata that we could use to make even more high-level models which we think will have a value in the community. Build these models and publicly release them, using our current machine learning framework.

Wishlist

These are tasks that we want to complete that will show off the data that we have in AcousticBrainz and allow us to do more things with the data, but should come after the high-impact tasks.

Expose AB data on MusicBrainz

As part of the process to cross-pollinate the brainz’s, we want to be able to show a small subset of AB data that we trust on the MB website. This could include information such as BPM, Key, and results from some of our high-level models.

Improve music playback

On the detail page for recordings we currently have a simple YouTube player which tries to find a recording by doing text search. We want to improve the reliability and functionality of this player to include other playback services and take advantage of metadata that we already have in the MusicBrainz database.

Scikit-learn models

The future of machine learning is moving towards deep learning, and our current high-level infrastructure written in the custom Gaia project by MTG is preventing us from integrating improved machine learning algorithms to the data that we have. We would like to rewrite the training/evaluation process using scikit-learn, which is a well known Python library for general machine learning tasks. This will make it easier for us to take advantage of improvements in machine learning, and also make our environment more approachable to people outside the MusicBrainz community.

Dataset editor improvements

Part of the high-level/machine learning process involves making datasets that can be used to train models. We have a basic tool for building datasets, however it is difficult to use for making large datasets. We should look into ways of making this tool more useful for people who want to contribute datasets to AcousticBrainz.

Search

With the integration of the MusicBrainz database into AcousticBrainz, we will be able to let people search for metadata related to items which we know only exist in AcousticBrainz. We think that this is a good way for people to explore the data, and also for people to make new datasets (see above). We also want to provide a way that lets people search for feature data in the database (e.g. “all recordings in the key of Am, between 100 and 110BPM”).

API updates

As part of the 2018 MetaBrainz summit we decided to unify the structure of the APIs, including root path and versioning. We should make AcousticBrainz follow this common plan, while also supporting clients who still access the current API.

We should become more in-line with the MetaBrainz policy of API access, including user-agent reporting, rate limiting, and API key use.

Request specific data

Many services who use the API only need a very small bit of information from a specific recording, and so it’s often not efficient to return the entire low-level or high-level JSON document. It would be nice for clients to be able to request a specific field(s) for a recording. This ties in with the “Expose AcousticBrainz data on MusicBrainz” task above.

Everything else

Fix all our bugs and make AcousticBrainz an amazing open tool for MIR research.

Thanks for reading! If you have any ideas or requests for us to work on next please leave a comment here or on the forums.

How five Queen songs went mainstream in totally different ways

Making graphs is easy. Making intuitive, easy-to-understand graphs? It’s harder than most people think. At the Rochester Institute of Technology, the ISTE-260 (Designing the User Experience) course teaches the language of design to IT students. For an introductory exercise in the class, students are tasked to visualize any set of data they desire. Students David Kim, Jathan Anandham, Justin W. Flory, and Scott Tinker used the MusicBrainz database to look at how five different Queen songs went mainstream in different ways. Continue reading

The end of the replication nightmare!

I’m pleased to report that our nightmare of finding/reconstructing the missing replication packets is finally over!

Through many heroic hours of work, Bitmap and Chirlu have reconstructed the missing replication packets. All clients should now be on their way to being up to date. We’ve learned a number of lessons (some good, some bad — that’s life, right?) in this ordeal and we hope to avoid these issues in the future.

An integral part of this recovery process were a number of people from our community who helped us: Users mbcz, rembo10 and xeam sent us their complete DB dumps! Bitmap used these to sanity check and diff several other database to finally extract the missing packets. Thank you for dropping what you were doing and sending us a few GB of data over blazingly fast connections. Without you this would not have been possible; and this is not an exaggeration. Thank you!

After some more rest we’re going to continue to put out smaller fires that remain from the move to NewHost, but for now, the big fires are put out. Just in time for the weekend!

In the 11 year history of the replication stream we’ve had to have users restart their stream about 3-4 times because of problems on our end. Zero would’ve been nicer, but I’m proud that we’ve been able to make this system work for so long. On a daily basis we seem to have about 400 replicated copies of MusicBrainz running all over the world. Clearly this part of our service is well used and I sleep a little better at night knowing that our most critical data is backed up across the globe.

Just for fun, here is a graph of the replication API usage over the last 6 months:

Towards the end the graph shows the week plus long break, then a small blip as some of our replicas got unstuck yesterday and the much larger spike shows the rest of the replicas getting unstuck. Now, as to what caused the blip in mid-October — I have no idea.

Anyways, please accept my apologies for the replication stream outage and keep replicating!

Thanks!

New MusicBrainz server virtual machine available

Time to check the weather forecast for hell, because it appears to have frozen over! We have finally released a new Virtual Machine that contains all of the MusicBrainz server software and fixed all of the currently outstanding bugs (for the VM).

The new VM now uses a 64-bit architecture and has 80GB of disk-space so it should be much easier to get along with. I tried to ship one VM that has the search indexes build in, but after 3 hours (and increasing time) of trying to export that VM I killed it. If someone has better luck exporting a VM after building search indexes, please let me know. Also, VirtualBox seems to have improved in stability on Mac OS, so we are not going to build a VMWare version of the VM at this time.

All the details for the new VM are on our Server Setup page.

Remember to get your Live Data Feed access token here if you plan to use the replication.

Downstream Wikipedia link usage and migration to Wikidata

MusicBrainz has linked to Wikipedia for many years and we now have links to Wikidata as well. Wikidata, however, acts as a central repository for Wikipedia links, so it does not make sense for MusicBrainz to maintain its own separate set of Wikipedia links, especially since Wikipedia URLs are not very stable (because of page moves and deletions) and require a lot of maintenance. Most of our data with Wikipedia links is now also linked to Wikidata, so we plan to start removing Wikipedia links where we have a Wikidata link which has the same Wikipedia link.

What this means for downstream data users:

If you use Wikipedia links, we will provide Wikidata links but you will need to fetch the Wikipedia links you want from Wikidata separately. Wikidata has information on ways to access their data at https://www.wikidata.org/wiki/Wikidata:Data_access

We plan to start removing the links after the schema change this month, starting with the less common languages and entity types. It will take a while to work through the existing links, so we don’t expect to start removing English links from artists until after the Autumn schema change.

We recognise that some people may have code which depends on these links – if you’re using these links and the above sounds problematic, please let us know how you’re using the data (which languages and entity types) and how much time you would need to support Wikidata.

Editing: Making MusicBrainz better

Over the past few weeks I’ve received a number of emails from people who are concerned about some editors who are losing sight of some basic principles behind editing data in MusicBrainz. I wanted to chime in and remind people of some of the principles that should guide how we all get along when we edit data in MusicBrainz.

First and foremost is:

Be polite and give people the benefit of the doubt that they are doing the right thing.

I don’t have to explain being polite. Yes, we all have our bad days — that is a given. But if you’re having a bad day, stop editing MusicBrainz and step away from your computer. Go outside! When you do edit, please be kind to your fellow editors.

Giving people the benefit of the doubt that they are doing the right thing is also important. The vast majority of people who edit MusicBrainz have good intentions and you should assume that to be the case.

Second, edit to make the database better. Vote yes if an edit makes the data better.

This one is a lot more vague, since “better” is a subjective term. We should accept edits that are “good enough” and avoid asking people to make “perfect” edits.

Edits fit into four categories:

Edits that makes things better (perfect or not)
Edits makes things different (but neither are better)
Edits that contain some correct things and some incorrect things
Edits that are outright wrong (existing data is better)

The first type should clearly get a yes vote. For the second, if it doesn’t make things worse, abstain and leave a comment. The third is a judgement call and I would suggest applying this heuristic:

Unless it takes more time to fix the edit than to make a new one, vote yes.

Clearly, the fourth type deserves a no vote.

That brings me to the final topic for now: No votes. A no vote is a very strong expression that has potentially chilling effects that may prevent people from editing again. A no vote should be considered the last resort. Use a no vote if you can’t find another way to resolve an edit.

Finally, some tips for auto editors: If you see an edit that is not perfect, approve it and fix it.

Auto editors are supposed to set the tone for the project and auto editors should practically never vote no on something. You have more powers than fellow editors, so please use your powers for good!

Thanks and happy (and polite) editing!