How to build your own music tagger, with MusicBrainz Canonical Metadata

In the blog post where we introduced the new Canonical Metadata dataset, we suggested that a user could now build their own custom music tagging application, without a lot of effort! In this blog post we will walk you through the process of doing just that, using Python.

Step 1: Download the data and load into your favorite data store

Here at MetaBrainz, we’re die-hard Postgres fans. But the best tool that we’ve found for metadata matching is the Typesense search engine, which supports typo-resistant search. This example will use the Typesense datastore, but you may use whatever datastore you prefer.

Download the dataset, and then import it into your datastore.

Step 2: Understand the powerful lookup mechanism

The magic of this dataset is hidden in the combined_lookup column of the canonical_musicbrainz_data.csv table. This field takes the artist name and recording name, removes punctuation, white space, and converts the text to a simple ASCII representation. This makes this column perfect for performing lookups that avoid whitespace and diacritic issues.

The genius behind this process is the Unidecode module for python:

>>> from unidecode import unidecode
>>> unidecode("Blümchen")
'Blumchen'
>>> unidecode("モーニング娘。")
'moninguNiang . '
>>>

As you can see, the unidecode function removes diacritics (ü, in this example) and replaces it with the base letter (u). Furthermore, it attempts to transliterate languages into an ASCII character representation; this process makes looking up data much simpler. It reduces all of the data to its simplest form, omitting noise that might trip up matching the data.

The combined_lookup field contains the artist name and recording name for a given canonical recording, unidecoded, lower cased and all whitespace removed. Let’s take “Blümchen – Nur Geträumt” as an example:

 “Blümchen”, “Nur Geträumt” -> blumchennurgetraumt

All of the useless guff that can trip up the matching process has been removed!

Step 3: Create python script to lookup music metadata

First, define this function, that takes the artist_name and recording_name and returns a combined_lookup field:

def make_combined_lookup(self, artist_name, recording_name):
     return unidecode(re.sub(r'[^\w]+', '', 
                      artist_name + recording_name).lower())

Second, take the metadata you wish to lookup and create a combined_lookup field. Search for that in the canonical_musicbrainz_data table, in your datastore:

def lookup_track(artist_name, recording_name):
    combined_lookup = make_combined_lookup(artist_name, recording_name)
    match = datastore_lookup(combined_lookup)
    if match is None:
        print(“track was not found”)
    else:
        print(“track was matched:)

    print(f“  artist: {match[“artist_credit_name”}”)
    print(f“  release_mbid: {match[“artist_mbids”}”)
    print(f“  release: {match[“release_name”}”)
    print(f“  release_mbid: {match[“release_mbid”}”)
    print(f“  recording: {match[“recording_name”}”)
    print(f“  recording_mbid: {match[“recording_mbid”}”)

Now that you’ve done the hard work, all is left to write the metadata to some tags or your datastore and that is it. However, we strongly urge you to have a human review the matches before accepting them — our process is good, but not good enough to be fully automated. We suggest using a method like this to do the bulk tagging of your collection (if you have decent metadata to start with) and then to tag the rest of the more tricky files with MusicBrainz Picard.

That is all you need to do to create the core of your very own tagger or metadata lookup engine! If you would like to run this example code yourself, check out the canonical_data_example repository on GitHub.

Have fun and leave a comment if you have questions!

2 thoughts on “How to build your own music tagger, with MusicBrainz Canonical Metadata”

  1. You write an important hint:
    “However, we strongly urge you to have a human review the matches before accepting them”.

    Your example with artist “Blümchen” and her song “Nur geträumt” is a very good sample, why there is much more to match then only the artist and a song title.
    As we can see in https://musicbrainz.org/search?query=%22Nur+Getr%C3%A4umt%22+Bl%C3%BCmchen&type=recording&limit=25&method=indexed
    there are about 15 different versions of this song on different releases at different track positions in different years and as Single or as a normal track on a album from Blümchen or as part of compilations.

    Maybe you can further reduce the effort for users “that could now build their own custom music tagging application” if you publish some example code to pin down the really matching track – provided, the necessary information is available at all.

  2. The whole point of the canonical recordings is that if you want to get close, but don’t care about perfect, then the canonical recording that we picked is the right choice for you. However, if someone wanted to get *perfect* they can use the canonical recording to find other recordings that might be on other releases and show that as options for the user to choose from. Let me see about adding a comment to the demo project.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.