New dataset: MusicBrainz Canonical Metadata

The MusicBrainz project is proud to announce the release of our latest dataset: MusicBrainz Canonical Metadata. This geeky sounding dataset packs an  intense punch! It solves a number of problems involving how to match a piece of music metadata to the correct entry in the massive MusicBrainz database.

The MusicBrainz database aims to collect metadata for all releases (albums) that have ever been published. For popular albums, there can be many different releases, which begs the question “which one is the main (canonical) release?”. If you want to identify a piece of metadata, and you only have an artist and recording (track) name, how do you choose the correct database release?

This same problem exists on the recording level – many recordings (songs) exist on many releases – which one should be used?

The  MusicBrainz Canonical Metadata dataset now solves this problem by allowing users to lookup canonical releases and canonical recordings. Given any release MBID, Canonical Release Mapping (canonical_release_redirect.csv) allows you to find the release that we consider “canonical”. The same is now true for recording MBIDs, which allows you to look up canonical recordings using the Canonical Recording Mapping (canonical_recording_redirect.csv). Given any recording MBID, you can now find the correct canonical recording MBID.

These datasets solve some tricky problems for our data users, but the last table gets really interesting: Canonical Metadata (canonical_musicbrainz_data.csv). This table contains all the string metadata necessary to make effective use of the two datasets above. Artist names, release names and recording names are all present in this table, indexed against artist_credit_id, release_mbid and recording_mbid.

The real power of this table comes from its ability to provide an easy and compact way to identify your own music files, or to match/correct music metadata. Looking up your music tracks on MusicBrainz can be a challenge, since you need to understand the intricate schema underneath. Not so with this dataset – import the dataset into your favorite datastore and start looking up tracks to clean up. (That is, if our Picard tagging application isn’t sufficient for your needs!)

In a follow up blog post, I will walk you through the process of using this dataset to identify music tracks, using Python. Update: Learn how to build your own music tagger in this post!

Find out more about this dataset on the canonical dataset documentation page, and browse all of our datasets from our new datasets documentation page.

5 thoughts on “New dataset: MusicBrainz Canonical Metadata”

  1. Interesting. How was the canonical release determined for say a Beatles album? What were the rules used?

    I had always thought the earliest release in a release group was the ‘canonical’ release. What’s different?

  2. The process starts by ordering all releases by: release group type, release group secondary type, release format (preferring digital over analog releases), release date and finally release country.

    Then processing releases in that order, each recording on a release is entered into and artist_name / recording_name index, where only first instance of each unique pair is kept. Each of these pairs are considered canonical and all others of the same spelling are considered redirects to this canonical recording.

    It isn’t perfect, but it is really good and fairly simple.

  3. Beggar’s Banquet by The Rolling Stones has release group 6e672bbd-7c7f-32f8-8335-c603be99d13b and canonical release reid:8833c2be-4717-3d58-96f2-33f592a3dc06 which is the 1986 CD release in Germany. In what way is that the main or correct release instead of the 1968 UK release? It could be because the primary use (or data source) of musicbrainz is assumed to be annotating tracks ripped from a CD? In which case, why feature any vinyl releases at all?

    In any case, thanks for maintaining the database and releasing potentially convenient tables like this one.

  4. Hi!

    MusicBrainz is built as a structured music encyclopedia and serves many purposes, including but far from limited to tagging. The canonical dataset is used to match incoming ListenBrainz data to MusicBrainz.

    And that mapping prefers CD releases over Vinyl release because ListenBrainz plays digital content — so far I’ve not heard about playlists being played out via Vinyl, but I look forward to the day when that becomes possible. 🙂

    And on top of this, all of this is a bit inexact — picking a fitting track from 38M rows based solely on artist name and recording is not an exact science.

Leave a Reply

Your email address will not be published. Required fields are marked *