Cleaning up the Music Listening Histories Dataset

Hi, this is Prathamesh Ghatole (IRC Nick: “Pratha-Fish”), and I am an aspiring Data Engineer from India, currently pursuing my bachelor’s in AI at GHRCEM Pune, and another bachelor’s in Data Science and Applications at IIT Madras. 

I had the pleasure to be mentored by alastairp and the rest of the incredible team at the MetaBrainz Foundation. Throughout this complicated but super fun project as a GSoC ‘22 contributor! This blog is all about my journey over the past 18 weeks.

In an era where music streaming is the norm, it is no secret that to create modern, more efficient, and personalized music information retrieval systems, the modelling of users is necessary because many features of multimedia content delivery are perceptual and user-dependent. As music information researchers, our community has to be able to observe, investigate, and gather insights from the listening behavior of people in order to develop better, personalized music retrieval systems. Yet, since most media streaming companies know that the data they collect from their customers is very valuable, they usually do not share their datasets. The Music Listening Histories Dataset (MLHD) is the largest-of-its-kind collection of 27 billion music listening events assembled from the listening histories of over 583k last.fm users, involving over 555k unique artists, 900k albums, and 7M tracks. The logs in the dataset are organized in the form of sanitized listening histories per user, where each user has one file, with one log per line. Each log is a quadruple of: 

<timestamp, artist MBID, release-MBID, recording MBID>

The full dataset contains 576 files of about 1GB each. These files are subsequently bundled in sets of 32 TAR files (summing up to ~611.39 GB in size) in order to facilitate convenient downloading.

Some salient features of the MLHD:

  • Each entity in every log is linked to a MusicBrainz Identifier (MBID) for easy linkage to other existing sources.
  • All the logs are time-stamped, resulting in a timeline of listening events.
  • The dataset is freely available and is orders of magnitudes larger than any other dataset of its kind.
  • All the data is scraped from last.fm, where users publicly self-declare their music listening histories.

What is our goal with this project?

The dataset would be useful for observing many interesting insights like:

  • How long people listen to music in a single session
  • The kinds of related music that people listen to in a single session
  • The relationship between artists and albums and songs
  • What artists do people tend to listen to together?

In its current form, the MLHD is a great dataset in itself, but for our particular use-case, we’d like to make some additions and fix a few issues inherently caused due to last.fm’s out-of-date matching algorithms with the MusicBrainz database. (All issues are discussed in detail in my original GSoC proposal)

For example:

  1. The artist conflation issue: We found that the artist MBIDs for commonly used names were wrong for many logs, where the artist MBID pointed to incorrect artists with the same name in the MusicBrainz database. e.g. For the song “Devil’s Radio” by ”George Harrison” (from the Beatles), the MLHD incorrectly points to an obscure Russian hardcore group named “George Harrison” 
  2. Multiple artist credits: The original MLHD provides only 1 single artist-MBID, even in case of recordings with multiple artists involved. We aim to fix that by providing a complete artist credit list for every recording.
  3. Complete data for every valid recording MBID: We aim to use the MusicBrainz database to fetch accurate artist credit lists and release MBIDs for every valid recording MBID, hence improving the quality and reliability of the dataset.
  4. MBID redirects: 22.7% of the recording MBIDs (from a test set of 371k unique recording MBIDs) that we tested were not suitable for our direct use. Of the 22.7% of recording MBIDs, 98.66% MBIDs were just redirected to other MBIDs (that were correct too).
  5. Non-Canonical MBIDs: A significant fraction of MBIDs were not canonical MBIDs. In the case of recording MBIDs, a release-group might use multiple valid MBIDs to represent the release, but there’s always a single MBID that is the “most representative” of the release group, known as a “canonical” MBID.

While the existing redirecting as well as non-canonical MBIDs are technically correct and identical when considered in aggregate, we think replacing these MBIDs with their canonical counterparts would be a nice addition to the existing dataset and aid in better processing. Overall, the goal of this project is to write high-performance python code to resolve the dataset as soon as possible to an updated version, in the same format as the original, but with incorrect data rectified & invalid data removed.

Checkout the complete codebase for this project at: https://github.com/Prathamesh-Ghatole/MLHD 

The Execution

Personally, I’d classify this project as a Data Science or Data Engineering task involving lots of analytics, exploration, cleanup, and moving goals and paths as a result of back-and-forth feedback from stakeholders. For a novice like me, this project was made possible through many iterations involving trial and error, learning new things, and constantly evolving existing solutions to make them more viable, and in line with the goals. Communication was a critical factor throughout this project, and thanks to Alastair, we were able to communicate effectively on the #Metabrainz IRC channel and keep a log of changes in my task journal, along with weekly Monday meetings to keep up with the community.

Skills/Technologies used

  • Python3, Pandas, iPython Notebooks – For pretty much everything
  • NumPy, Apache Arrow – For optimizations
  • Matplotlib, Plotly – For visualizations
  • PostgreSQL, psycopg2 – For fetching MusicBrainz database tables, quick-and-dirty analytics, etc.
  • Linux – For working with a remote Linux server for processing data.
  • Git – For version control & code sharing.

Preliminary Analysis

1. Checking the demographics for MBIDs

We analyzed 100 random files from the MLHD with 3.6M rows and found the following results. In the 381k unique recording MBIDs, ~22.7% were not readily usable, i.e. they had to be redirected, or had to be made canonical. However, of these ~22.7% MBIDs, ~98.66% were correctly redirected to a valid recording MBID using the MusicBrainz database’s “recording” table, implying that only ~0.301% of all UNIQUE recording MBIDs from MLHD were completely unknown (i.e. Didn’t belong to the “recording” table OR have a valid redirect). Similarly, about ~5.508% of all UNIQUE artist MBIDs were completely unknown. (Didn’t belong to the “artist” table OR “artist_gid_redirect” table)

2. Checking for the artist conflation Issue:

There are many artists with exactly the same name. But we were unsure if for such cases last.fm’s algorithms matched the correct artist MBID for a recording MBID every time. To verify this, we fetched artist MBIDs for each recording MBID in a test set and compared it to the actual artist MBIDs present in the dataset. Lo and behold, we discovered that ~9.13% of the cases faced this issue in our test set with 3,76,037 unique cases.

SOLUTION 1

This is how we first tried dealing with the artist conflation issue:

  1. Take a random MLHD file
  2. “Clean up” the existing artist MBIDs and recording MBIDs, and find their canonical MBIDs. (Discussed in detail in the section “Checking for non-canonical & redirectable MBIDs”)
  3. Fetch the respective artist name and recording name for every artist MBID and recording MBID from the MusicBrainz database.
  4. For each row, pass <artist name, recording name> to either of the following MusicBrainz APIs:
    1. https://datasets.listenbrainz.org/mbc-lookup 
    2. https://labs.api.listenbrainz.org/mbid-mapping
  5. Compare artist MBIDs returned by the above API to the existing artist MBIDs in MLHD.
  6. If the existing MBID is different from the one returned by the API, replace it.

However, this method meant making API calls for EACH of the 27bn rows of the dataset. This would mean 27 billion API calls, where each call would’ve at least taken 0.5s. I.e. 156250 days just to solve the artist conflation issue. This was in no way feasible, and would’ve taken ages to complete even if we parallelized the complete process with Apache Spark. Even after all this, the output generated by this API would’ve barely been a fuzzy solution prone to errors.

SOLUTION 2

Finally, we tackled the artist conflation issue by using the MusicBrainz database to fetch artist credit lists for each valid recording MBID using the MusicBrainz database. This enabled us to perform in-memory computations, and completely eliminated the need to make API calls, saving us a lot of processing time. This did not only make sure that every artist MBID corresponded only to its correct recording MBID accurately 100% of the time but also:

  • Improved the quality of the provided artist MBIDs by providing a list of artist MBIDs in case of records with multiple recording MBIDs.
  • Increased the count of release MBIDs in the dataset by 10.19%!
    (Test performed on the first 15 files from the MLHD, summing up to 952229 rows of data)

3. A new contender appears! (Fixing the MBID mapper)

While working out “SOLUTION 1” as discussed in the previous section, we processed thousands of rows of data, and compared the outputs by the mbc-lookup API and mbid-mapping API, and discovered that these APIs sometimes returned different outputs when they should have returned the same outputs. This uncovered a fundamental issue in the mbid-mapping API that was actively being used by listenbrainz to link music logs streamed by users to their respective entities in the MusicBrainz database. We spent a while trying to analyze the depth of this issue by generating test logs and reports for both the mapping endpoints and discovered patterns that helped point to some bugs in the matching algorithms written for the API. This new discovery helped lucifer debug the mapper, resulting in the following pull request: Fix invalid entries in canonical_recording_redirect table by amCap1712 · Pull Request #2133 · metabrainz/listenbrainz-server (github.com)

4. Checking for non-canonical & redirectable MBIDs

To use the MusicBrainz database to fetch artist names and recording names w.r.t. their MBIDs, we first had to make sure MBIDs we used to lookup the names were valid, consistent, and “clean”. This was done by:

  1. Checking if an MBID was redirected to some other MBID, and replacing the existing MBID with the MBID it redirected to.
  2. Finding a Canonical MBID for all the recording MBIDs.

We used the MusicBrainz database’s mapping.canonical_recording_redirect” table to fetch canonical recording MBIDs, and recording_gid_redirect” table to check and fetch redirects for all the recording MBIDs. We first tried mapping SQL query on every row to fetch results, but soon realized it would’ve slowed the complete process down to unbearable levels. Since we were running the processes on “Wolf” (A server at MetaBrainz Inc.) we had access to 128GB of RAM, enabling us to load all the required SQL tables in memory using Pandas, eliminating the need to query SQL tables stored on disk.

5. Checking for track MBIDs being mistaken for recording MBIDs

We suspected that some of the unknown recording MBIDs in the dataset could actually be track MBIDs disguised as recording MBIDs due to some errors in mapping. While exploring the demographics on a test sample set of 381k unique recording MBIDs, we discovered that none of the unknown recording MBIDs confirmed this case. To further verify this problem, we ran the tests on ALL recording MBIDs in the MLHD. To hit 2 birds in one iteration, we also re-compressed every file in the MLHD from GZIP compression to a more modern, ZStandard compression, since GZIP read/write times were a huge bottleneck while costing us 671GB in storage space. This process resulted in:

  • The conversion of all 594410 MLHD files from GZIP compression to ZSTD compression in 83.1 hours.
  • The dataset being reduced from 571 GB -> 268 GB in size. (53.75% Improvement!)
  • File Write Speed: 17.46% improvement.
  • File Read Speed: 39.25% deterioration.
  • Confirmed the fact that no track MBID existed in the recording MBID column of the MLHD.

Optimizations

1. Dumping essential lookup tables from the MusicBrainz database to parquet.

We used the following tables from the MusicBrainz database in the main cleanup script to query MBIDs:

  1. recording: Lookup recording names using recording MBIDs, Get a list of canonical recording MBIDs for lookups.
  2. recording_gid_redirect: Lookup valid redirects for recording MBIDs using redirectable recording MBIDs as index.
  3. mapping.canonical_recording_redirect: Lookup canonical recording MBIDs using non-canonical recording MBIDs as index.
  4. mapping.canonical_musicbrainz_data:Lookup artist MBIDs, and release MBIDs using recording MBIDs as index.

In our earlier test scripts, we mapped SQL queries over the recording MBID column to fetch outputs. This resulted in ridiculously slow lookups where a lot of time was being wasted in I/O. We decided to pre-load the tables into memory using pandas.read_sql(), which added some constant time delay at the beginning of the script, but reduced the lookup timings from dozens of seconds to milliseconds. Pandas documentation recommends using SQLAlchemy connectable to fetch SQL tables into pandas. However, we noticed that pandas.read_sql() with a psycopg2 Connector was 80% faster than pandas.read_sql() with a SQLAlchemy Connector. Even though the pandas officially doesn’t recommend using psycopg2 at all. Fetching the same tables from the database again and again was still slow, so we decided to dump all the required SQL tables to parquet, causing a further 33% improvement in loading time.

2. Migrating the CSV reading function from pd.read_csv() to pyarrow._csv.write_csv():

We started off by using custom functions based on pandas.read_csv() to read CSV files and preprocess them (rename columns, drop rows as required, concatenate multiple files if specified, etc.). Similarly, we used pandas.to_csv() to write the files. However, we soon discovered that these functions were unnecessarily slow, and a HUGE bottleneck for processing the dataset. We were able to optimize the custom functions by leveraging pandas’ built-in vectorized functions instead of relying on for loops to pre-process dataframes once loaded. This brought down the time required to load test dataframes significantly.

pandas.read_csv() and pandas.to_csv() on their own are super convenient, but aren’t super performant. Especially when you need them to compress/decompress files before reading/writing. Pandas’ reading/writing functions come with a ton of extra bells and whistles. Intuitively, we started writing our own barebones CSV reader/writer with NumPy. Turns out this method was far slower than the built-in pandas methods! We tried vectorizing our custom barebones CSV reader using Numba, an open-source JIT compiler that translates a subset of Python and NumPy code into fast machine code. However, this method too failed due to various reasons. (Mostly by my own inexperience with Numba). Finally, we tried pyarrow, a library that provides Python APIs for the functionality provided by the Arrow C++ libraries, including but not limited to reading, writing, and compressing CSV files. This was a MASSIVE success, causing +86.11% in writing speeds and 30.61% improvements in reading speeds even while writing back DataFrames as CSV with ZSTD level 10 compression!

3. Pandas optimizations

In pandas, there are often multiple ways to do the same thing, and some of them are much faster than others due to their implementation. We realized a bit too late, that pandas isn’t that good for processing big data in the first place! But I think we were pretty successful with our optimizations and made the best out of pandas too. Here are some neat optimizations that we did along the way in Pandas.

pd.DataFrame.loc() returns the whole row (a vector of values), but pd.DataFrame.at() only returns single value (a scalar). 

Intuitively, pd.DataFrame.loc() should be faster to search and return a tuple of values as compared to pd.DataFrame.at() since the latter requires multiple nested loops per iteration to fetch multiple values for a single query, whereas the prior doesn’t. However, for our use case, running pd.DataFrame.at() 2x per iteration for fetching multiple values was still ~55x faster than running pd.DataFrame.loc() once for fetching the complete row at once!

Some of the most crucial features that pandas offers are vectorized functions.
Vectorization in our context refers to the ability to work on a set of values in parallel. Vectorized functions just do a LOT more work in a single loop, enabling them to produce results wayy faster than typical for-loops, that operate on a single value per iteration. In pandas, these vectorized functions can mean speeding up operations by as much as 1000x! For MLHD, we fetched artist MBIDs and release MBIDs (based on a recording MBID) as a tuple representing a pair of MBIDs. This meant a tuple for each recording MBID, leaving us with a series of tuples, that we needed to split into two different series. The most simple solution to this issue would be to just use tuple unpacking using python’s built-in zip function as follows:

artist_MBIDs, recording_MBIDs = zip(* series_of_tuples)

For our particular case, we also had to add in an extra step of mapping a “clean up” function to the whole series before unzipping it. The mapping process in the above case was a serious bottleneck, so we had to find something better. However, in we were able to significantly speed up the above process, by avoiding apply/map functions completely, and cleverly utilizing existing vectorized functions instead. The details for the solution can be found at: quickHow: how to split a column of tuples into a pandas dataframe (github.com)

In our first few iterations, we used pandas.Series.isin() to check if a series of recording MBIDs existed in the “recording” table of the MusicBrainz database or not. Pandas functions in general are very optimized and occasionally written in C/C++/Fortran instead of Python, making them super fast. I assumed the same would be the case with pandas.Series.isin(). My mentor suggested that we use built-in Python sets and plain for-loops for this particular case. My first reaction was “there’s no way our 20 lines of Python code are gonna hold up against pandas. Everyone knows running plain for-loops against vectorized functions is a huge no-no”. But, as we kept on trying, the results completely blew me away! Here’s what we tried:

  1. Convert the index (a list) of the “recording” table from the MusicBrainz database to a Python Set.
  2. Iterate over all the recording MBIDs in the dataset using a for-loop, and check if MBIDs exist in the set using Python’s “in” keyword.

For 4,738,139 rows of data, pandas.Series.isin() took 13.1s to process. The sets + for-loop method took 1.03s! (with an additional, one-time cost of 6s to convert the index list into a Set). The magic here was done by converting the index of the “recording” table into a Python Set, which essentially puts all the values in a hashmap (which only took a constant 6 seconds at the start of the script).

A hashmap meant reducing the time complexity for search values to O(1). On the other hand, pandas.Series.isin() was struggling with at least O(n) time complexity, given that it’s essentially a list search algorithm working on unordered items. This arrangement meant only a one-time cost of converting the index to a Python Set at the start of the script, and a constant O(1) time complexity to loop through and search for items.

Final Run

As of October 20, 2022 – We’ve finally started testing for all 594410 MLHD files to process and re-write ~27 billion rows of data. The output for a test performed on the first 15 files from the MLHD, summing up to 952229 rows of data is as follows:

Here, the cleanup process involves: Fetching redirects for recording MBIDs; Fetching canonical MBIDs for recording MBIDs; Fetching artist credit lists and release MBIDs based on recording MBIDs; and Mapping the newly fetched artist credit lists and release MBIDs to their respective recording MBIDs.

The above process is completely recording MBID oriented in order to maintain quality and consistency. This means completely wiping off the data in the artist_MBID and release_MBID columns in order to replace them with reliable data fetched from the MusicBrainz database. This also means that the above process will bring a significant change in the demographics of various entities (artist MBIDs, release MBIDs, and recording MBIDs) in the final version of the dataset.

Even though the impact of changing demographics varies from file to file (depending on the user’s tendency to listen to certain recordings repeatedly), here are some statistics based on the first 15 files in the MLHD, before and after processing:

For a complete test set with 952,229 input rows, the shrinkage is as follows:
Given an input of 952,229 rows, the row count of the original MLHD shrinks to 789,788 rows after dropping rows with empty recording MBID values. (17.06% Shrinkage). After processing, given the same input, the row count of the processed MLHD shrinks to 787,690 rows after dropping rows with empty recording MBID values. (17.28% Shrinkage). Now for a fair comparison, let’s first drop all rows with empty recording MBID values from the original, as well as the processed dataset. This gives us 787,690 in the processed dataset and 789,788 in the original dataset. The absolute shrinkage between the original and processed dataset is as follows:

Abs_shrinkage = ((789788 - 787690) / 789788) * 100 = 0.27%

Therefore, the cleaning process only resulted in a shrinkage of 0.27% of the existing recording MBIDs in the MLHD! Note that this stat is also in line with our previous estimate about how ~0.301% of all recording MBIDs were completely unknown. As per the original MLHD research paper, about ~65% of the MLHD contains at least the recording MBID. We might have the option to drop the rest of the 35% of the dataset or keep the data without recording-MBIDs as it is. Out of the 65% of the MLHD with recording MBIDs, ~0.301% of the recording MBIDs would’ve to be dropped (since they’re unknown). This leaves us with: 27bn – (35% of 27bn) – (0.3% of 65% of 27bn) = 12.285bn rows of super high quality data!

Now similarly, let’s compare the row count shrinkages for different columns.

  1. Number of counts of not-empty recording MBIDs SHRINKED by 0.27%.
  2. Number of counts of not-empty release MBIDs EXPANDED by 14.08%.
  3. Number of counts of not-empty artist MBIDs SHRINKED by 13.36%

Given an average processing time per 10,000 rows of 0.2168s, we estimate the time taken to process the entire dataset will be 27,00,00,00,000 / 10,000 * 0.2168 / 3600 / 24 = 6.775 days or 162 hours

Primary Outcomes

  1. The MLHD is currently set to be processed with an ETA of ~7 days of processing time.
  2. I was able to generate various reports to explore the impact of the “artist conflation issue” in the MLHD. These extra insights and reports uncovered a few issues within the MusicBrainz ID Mapping lookup algorithm, which resulted in lucifer fixing Fix invalid entries in canonical_recording_redirect table by amCap1712 · Pull Request #2133 · metabrainz/listenbrainz-server (github.com)

Miscellaneous Outcome

How I got picked as a GSoC candidate without a single OSS PR to my name beforehand is beyond me, but with the help of alastairp and lucifer, I was able to solve and merge PRs for 2 issues in the listenbrainz-server as an exercise to gain get to know the listenbrainz codebase a little better.

My Experience

This journey has been nothing but amazing! The sheer amount of things that I learned during these past 18 weeks is ridiculous. I really appreciate the fun people and work culture here, which was even more apparent during the MetaBrainz Summit 2022 where I had the pleasure to see the whole team in action on a live stream.

Coming from a Music Tech background and having extensively used MetaBrainz products in the past, it was a surreal experience being able to work with these supersmart engineers who have worked on technologies I could only dream of making. I often found myself admiring my seniors as well as peers for their ability to come up with pragmatic solutions with veritable speed and accuracy, especially lucifer, whose work ethic inspired me the most! I hope some of these qualities eventually rub off on me too 🙂

I’d really like to take time to appreciate my mentor, alastairp for always being super supportive, and precise in his advice, and helping me push the boundaries whenever possible. I’d also like to thank him for being very considerate, even through times when I’ve been super unpredictable and unreliable, and not to mention, giving me an opportunity to work with him in the first place!

Suggestions for aspiring GSoC candidates

  • Be early.
  • Ask a lot of questions, but do your due diligence by exploring as much as possible on your own as well.
  • OSS codebases might seem ginormous to most students with limited work experience. Ingest the problem statement bit by bit, and slowly work your way toward potential solutions with your mentor.
  • Believe in yourself! It’s not a mission impossible. You always miss the shots that you don’t take.

You can contact me at:
IRC: “Pratha-Fish” on #Metabrainz IRC channel
Linkedin: https://www.linkedin.com/in/prathamesh-ghatole
Twitter: https://twitter.com/PrathameshG69 
GitHub: https://github.com/Prathamesh-Ghatole

AcousticBrainz: Making a hard decision to end the project

We created AcousticBrainz 7 years ago and started to collect data with the goal of using that data down the road once we had collected enough. We finally got around to doing this recenty, and realised that the data simply isn’t of high enough quality to be useful for much at all.

We spent quite a bit of time trying to brainstorm on how to remedy this, but all of the solutions we found require a significant amount of money for both new developers and new hardware. We lack the resources to commit to properly rebooting AcousticBrainz, so we’ve taken the hard decision to end the project.

Read on for an explanation of why we decided to do this, how we will do it, and what we’re planning to do in the future.

Why we’re doing this

When we launched AcousticBrainz, we had a few goals which we wanted to achieve with the project and collected data

  • Generate a list of musical characteristics of audio recordings, such as musical key and tempo (BPM).
  • Use the extracted data to automatically predict other musical characteristics such as instrumentation, genre, or mood of the music based on the current state of the art algorithms and models for music classification.
  • Provide a source of mathematical features extracted from audio which other people could use to build their own models to predict other musical characteristics

Unfortunately, a number of things happened with the data that we collected which made us decide that the quality of the data isn’t as useful as we had hoped

  • The musical key data that we were generating was accurate on some styles of music, but not on the full range of music that we collected in AcousticBrainz. The BPM tools work well on a wide range of music, but there are many recordings for which the predicted value is incorrect. The data that is generated by these algorithms is unable to indicate a confidence level of the predicted value, and so we are unable to determine which data we can trust.
  • Early on in the release of the AcousticBrainz data we determined that the existing models that we had for categories such as genre didn’t work very well, however further experiments that we performed to build new models showed that it was difficult to get good results that covered the full range of content in the database.
  • Right about the time that we released the AcousticBrainz data extractor, Deep Learning techniques for performing this kind of prediction started to become more prevalent. Unfortunately, the resolution of the data that we collect in AcousticBrainz is not enough to be used in this type of machine learning, and so we were unable to try these new techniques using the data that we had available in the database. The type of data that we made available meant that researchers and others who were working on this kind of task were not as interested in the data as we had hoped.
  • We spent some time introducing content-based similarity to AcousticBrainz, but when we used this data ourselves for generating similar / recommended recordings, it didn’t give good results.

Unfortunately, within the MetaBrainz team we don’t have the resources and developer availability to perform this kind of research ourselves, and so we rely on the assistance of other researchers and volunteers to help us integrate new tools into AcousticBrainz, which is a relationship that we haven’t managed to build over the last few years.

What we’re going to do next

Based on the current state of the data in AcousticBrainz, we don’t want to keep promoting it as an accurate representation of the music that has been analysed, therefore we have decided to stop collecting data. 

In the next month or so we will stop accepting new data submissions to AcousticBrainz. We’ll remove downloads for the submission tools, and modify the AcousticBrainz API to stop accepting new submissions. The rest of the API and other tools in the site will continue to work as before.

We’ll make a full dump of all data available in AcousticBrainz, so that if anyone wants to download and use it themselves, they will be able to do so. In early 2023 we will shut down the AcousticBrainz site.

What we’re planning to do in the future

Part of the initial goal of AcousticBrainz was to provide a way to characterise and organise the recordings that are in the MusicBrainz database. This is still something that we’re interested in collecting, and we have some ideas about how to integrate this into other MetaBrainz projects. We have a few current ideas about how we want to go about this:

  • Focus on user-provided tagging for music characteristics such as genre and mood/emotion. We have a good base for storing this in MusicBrainz, and plan to integrate new functionality into ListenBrainz to encourage the MetaBrainz community to help add more data. This data will be used in the new recommendation systems that we are starting to build into ListenBrainz.
  • Use some improved tools to compute specific musical characteristics. We have been reviewing some of the recent work in tempo estimation and are looking to see how we can integrate it with tools such as Picard so that we can allow people to compute these features if they need them, and help us confirm that the computed data is correct.

Importantly, this doesn’t mean that we are not interested in generating tools for music recommendation. On the contrary, our recent work has shown us that the data that we already have in ListenBrainz (user listening history), and data in MusicBrainz (metadata, relationships, links, and tags) give great results for the recommendations that we have started to build, and so we want to focus on improving and using this data going forward. Also, focusing only on one project, rather than two will actually allow us to reach these goals sooner.


Please leave a comment if you have any questions!

Acoustic similarity in AcousticBrainz

We’re pleased to announce that we have just released acoustic similarity in AcousticBrainz. Acoustic similarity is a technique to automatically identify which recordings sound similar to other recordings, using only the recordings themselves, and not any additional metadata. This feature is available via the AcousticBrainz API and the AcousticBrainz website, from any recording page. General documentation on acoustic similarity is available at https://acousticbrainz.readthedocs.io/similarity.html.

This feature is based on work started by Philip Tovstogan at the Music Technology Group, the research group that provides the essentia feature extractor that powers AcousticBrainz. The work was continued by Aidan Lawford-Wickham during Summer of Code 2019. Thanks Philip and Aidan for your work!

From the recording view on AcousticBrainz, you can choose to see similar recordings and choose which similarity metric you want to use. Then, a list of recordings similar to the initial recording will be shown.

These metrics are based on different musical features that the AcousticBrainz feature extractor identifies in the audio file. Some of these features are related to timbral characteristics (generally, what something sounds like), Rhythmic (related to tempo or perceived pulses), or AcousticBrainz’s high-level features (hybrid features that use our machine learning system to identify features such as genre, mood, or instrumentation).

One thing that we can immediately see in these results is that the same recording appears many times. This is because AcousticBrainz stores multiple different submissions for the same MBID, and will sometimes get submissions for the same recording with different MBIDs if the data in MusicBrainz is like this. This is actually really interesting! It shows us that we are successfully identifying that two different submissions in AcousticBrainz as being the same using only acoustic information and no metadata. Using the API you can ask to remove these duplicated MBIDs from the results, and we have some future plans to use MusicBrainz metadata to filter more of these results when needed.

What’s next?

We haven’t yet performed a thorough evaluation of the quality of these similarity results. We’d like people to use them and give us feedback on what they think. In the future we may look at performing some user studies in order to see if some specific features tend to give results that people consider “more” similar than others. AcousticBrainz has a number of additional features in our database, and we’d like to experiment with these to see if they can be used as similarity metrics as well.

The fact that we can identify the same recording as being similar even when the MusicBrainz ID is different is interesting. It could be useful to use this similarity to identify when two recordings could be merged in MusicBrainz.

The data files used for this similarity are stand-alone, and can be used without additional data from AcousticBrainz or MusicBrainz. We’re looking at ways that we can make these data files downloadable so that developers can use them without having to query the AcousticBrainz API. If you think that you might be interested in this, let us know!

Playlists and personalised recommendations in ListenBrainz

Just in time for Christmas we are pleased to announce a new feature in our most recent release of ListenBrainz, the ability to create and share your own playlists! We created two playlists for each user who used ListenBrainz containing music that you listened to in 2020. Check out your lists at https://listenbrainz.org/my/recommendations. Read on for more details…

With our continuing work on using data in ListenBrainz to generate recommendations, we realised that we needed a place to store lists of music. That sounded like playlists to us, so we added them to ListenBrainz. As always, we did this work in the public ListenBrainz repository. You can now create your own playlists with the web interface or by using the API. Recordings in playlists map to MusicBrainz identifiers. If you’re trying to add something and can’t find it, make sure that it’s in MusicBrainz first.

Once you have a playlist, you can listen to it using our built-in BrainzPlayer, or export it to Spotify if you have an account there. If you have already linked your Spotify account to ListenBrainz you may have to re-authenticate and give us permission to create playlists on your behalf. Playlists can also be exported in the open JSPF format using the ListenBrainz API.

Over the last year we’ve started thinking about how to use data in MetaBrainz projects to generate recommendations of new music for people to listen to. For this reason, we started the Troi recommendation framework. This python package allows developers to build pipelines that take data from different sources and combine it in order to generate recommendations of music to listen to. We have already developed data sources using MusicBrainz, ListenBrainz, and AcousticBrainz. If you are a developer interested in working on recommendations in the context of ListenBrainz we encourage you to check it out.

Now that we can store playlists we needed some content to fill them with. Luckily we have some great projects worked on by students over the last few years as part of MetaBrainz’ participation in the Google Summer of Code project, including this year’s work on statistics and summary information by Ishaan. Using Troi and ListenBrainz statistics, we got to work. Every user who has been contributing data to ListenBrainz recently now has two brand new 2020 playlists based on the top recordings that you listened to in 2020 and the recordings that you first listened to in 2020. If you’re interested in the code behind these playlists, you can see the code for each (top tracks, first tracks) in the troi repository.

If you’re a long-time user of ListenBrainz you may be familiar with the problem of matching your listens to content in MusicBrainz to be able to do things with it. We’ve been working hard on a solution to this problem and have built a new tool using typesense to provide a quick and easy way to search for items in the MusicBrainz database. You are using this tool when you create a playlists using the web interface and search for a recording to add. This is still a tech preview, but in our experience it works really well. Thanks to the team at typesense for helping us with our questions over the last few weeks!

This work is still in its early days. We thought that this was such a great feature that we wanted to get it out in front of you now. We’re happy to take your feedback, or hear if you are having any problems. Open a ticket on our bug tracker, come and talk to us on IRC, or @ us. Did we give you a bad jam? Sorry about that! We’d love to have a conversation about what went well and what didn’t in order to improve our systems. In 2021 we will start generating weekly and daily playlists for users based on your recent listens using our collaborative filtering recommendations system.

Merry Christmas from the whole MetaBrainz team!

AcousticBrainz at the 2018 MetaBrainz Summit

We had an in-person meeting at the MTG during the MetaBrainz summit to discuss the status and future of AcousticBrainz. We came up with a rough outline of things that we want to work on over the next year or so. This is a small list of tasks that we think will have a good impact on the image of AcousticBrainz and encourage people to use our data more.

State of AcousticBrainz

AcousticBrainz has a huge database of submissions (over 10 million now, thanks everyone!), but we are currently not using the wealth of data to our advantage. For the last year we’ve not had a core developer from MetaBrainz or MTG working on existing or new features in AcousticBrainz. However, we now have:

  • Param, who is including AcousticBrainz in his role with MetaBrainz
  • Rashi, who worked on AcousticBrainz for GSoC and is going to continue working with us
  • Philip, who is starting a PhD at MTG, focused on some of the algorithms/data going into AcousticBrainz
  • Alastair, who now has more time to put towards management of the project

Because of this, we’re glad to present an outline of our next tasks for AcousticBrainz:

Short-term

Some small tasks that are quick to finish and we can use to show off uses of the data in AcousticBrainz

Merge Philip’s similarity, including an API endpoint

Philip’s masters thesis project from last year uses PostgreSQL search to find acoustically similar recordings to a target recording. This uses the features in AcousticBrainz. We need to ensure that PostgreSQL can handle the scale of data that we have.

An extension of this work is to use the similarity to allow us to remove bad duplicate submissions (we can take all recordings with the same MBID and see if they are similar to each other, if one is not similar we can assume that it’s not actually the same as the other duplicates, and mark it as bad). We want to make these results available via an API too, so that others can check this information as well.

Merge Existing PRs

We have many great PRs from various people which Alastair didn’t merge over the last year. We’re going to spend some time getting these patches merged to show that we’re open to contributions!

Publish our Existing models

In research at MTG we’ve come up with a few more detailed genre models based on tag/genre data that we’ve collected from a number of sources. We believe that these models can be more useful that the current genre models that we have. The AcousticBrainz infrastructure supports adding new models easily, so we should spend some time integrating these. There are a few tasks that need to be done to make sure that these work

  • Ensure that high-level dumps will dump this new data (If we have an existing high-level dump we need to make a new one including the new data)
  • Ensure that we compute high-level data for all old submissions (we currently don’t have a system to go back and compute high-level data for old submissions with a new model, the high-level extractor has to be improved to support this)

Update/fix some pages

We have a number of issues reported about unclear text on some pages and grammar that we can improve. Especially important are

  • API description (we should remove the documentation from the main website and just have a link to the ReadTheDocs page)
  • Front page (Show off what we have in the project in more detail, instead of just a wall of text)
  • Data page (instead of just showing tables of data, try and work out a better way of presenting the information that we have)

Fix Picard plugin

When AB was down during our migration we were serving HTML from our API pages, which caused Picard to crash if the AB plugin was enabled while trying to get AB data. This should be an easy fix in the Picard plugin.

High Impact

These are tasks that we want to complete first, that we know will have a high impact on the quality of the data that we produce.

Frame-level data

We want to extract and store more detailed information about our recordings. This relies on working being done in MTG to develop a new extractor to allow us to get more detailed information. It will also give us other improvements to data that we have in AB that we know is bad. This data is much bigger than our current data when stored in JSON (hundreds of times larger), so we need to develop a more efficient way of storing submissions. This could involve storing the data in a well-known binary data exchange format. A bunch of subtasks for this project:

  • Finish the essentia extractor software
  • Decide on how to store items on the server (file format, store on disk instead of database)
  • Work out a way to deal with features from two versions of the extractor (do we keep accepting old data? What happens if someone requests data for a recording for which we have the old extractor data but not the new one?)
  • Upgrade clients to support this (Change to HTTPS, change to the new API URL structure, ensure that clients check before submission if they’re the latest version, work out how to compress data or perform a duplicate check before submission)
  • Deduplication (If we have much larger data files, don’t bother storing 200 copies for a single Beatles song if we find that we already have 5-10 submissions that are all the same)

MusicBrainz Metadata

Rashi’s GSoC project in 2018 helped us to replicate parts of the MusicBrainz database into AcousticBrainz. This allows us to do amazing things like keep up-to-date information about MBID redirects, and do search/browse/filtering of data based on relationships such as Artists just by making a simple database query. We want to merge this work and start using it.

Dumps

When we changed the database architecture of AcousticBrainz in 2015 we stopped making data dumps, making people rely on using the API to retrieve data. This is not scalable, and many people have asked for this data. We want to fix all of the outstanding issues that we’ve found in the current dumps system and start producing periodic dumps for people to download.

Build more models

In addition to the existing models that we’ve already built (see above, “Publish our Existing models”), we have been collecting a lot of metadata that we could use to make even more high-level models which we think will have a value in the community. Build these models and publicly release them, using our current machine learning framework.

Wishlist

These are tasks that we want to complete that will show off the data that we have in AcousticBrainz and allow us to do more things with the data, but should come after the high-impact tasks.

Expose AB data on MusicBrainz

As part of the process to cross-pollinate the brainz’s, we want to be able to show a small subset of AB data that we trust on the MB website. This could include information such as BPM, Key, and results from some of our high-level models.

Improve music playback

On the detail page for recordings we currently have a simple YouTube player which tries to find a recording by doing text search. We want to improve the reliability and functionality of this player to include other playback services and take advantage of metadata that we already have in the MusicBrainz database.

Scikit-learn models

The future of machine learning is moving towards deep learning, and our current high-level infrastructure written in the custom Gaia project by MTG is preventing us from integrating improved machine learning algorithms to the data that we have. We would like to rewrite the training/evaluation process using scikit-learn, which is a well known Python library for general machine learning tasks. This will make it easier for us to take advantage of improvements in machine learning, and also make our environment more approachable to people outside the MusicBrainz community.

Dataset editor improvements

Part of the high-level/machine learning process involves making datasets that can be used to train models. We have a basic tool for building datasets, however it is difficult to use for making large datasets. We should look into ways of making this tool more useful for people who want to contribute datasets to AcousticBrainz.

Search

With the integration of the MusicBrainz database into AcousticBrainz, we will be able to let people search for metadata related to items which we know only exist in AcousticBrainz. We think that this is a good way for people to explore the data, and also for people to make new datasets (see above). We also want to provide a way that lets people search for feature data in the database (e.g. “all recordings in the key of Am, between 100 and 110BPM”).

API updates

As part of the 2018 MetaBrainz summit we decided to unify the structure of the APIs, including root path and versioning. We should make AcousticBrainz follow this common plan, while also supporting clients who still access the current API.

We should become more in-line with the MetaBrainz policy of API access, including user-agent reporting, rate limiting, and API key use.

Request specific data

Many services who use the API only need a very small bit of information from a specific recording, and so it’s often not efficient to return the entire low-level or high-level JSON document. It would be nice for clients to be able to request a specific field(s) for a recording. This ties in with the “Expose AcousticBrainz data on MusicBrainz” task above.

Everything else

Fix all our bugs and make AcousticBrainz an amazing open tool for MIR research.


Thanks for reading! If you have any ideas or requests for us to work on next please leave a comment here or on the forums.

Announcing python-musicbrainzngs, release 0.6

From the better late than never department…

After more than 2 years we’ve finally released version 0.6 of python-musicbrainzngs, a library for accessing the Musicbrainz webservice from python.

After such a long time we have perhaps too many new changes to describe. Some major changes include:

  • Better handling of authentication private user collections
  • Support for loading all types of user collections (artist, event, place, recording, release, work)
  • Work attributes
  • Support for the Cover Art Archive
  • Support for Events, Instruments, Places, and Series

And numerous other bug fixes and small changes. See the CHANGES file  for more information.

This release contains contributions by Alastair Porter, Corey Farwell, Ian McEwen, Jérémie Detrey, Johannes Dewender, Pavan Chander, Rui Gonçalves, Ryan Helinski, Shadab Zafar, and Wieland Hoffmann. Thank you everyone!

 

The new version can be downloaded from github, pypi, or installed with pip

AcousticBrainz Update

It’s been over a year since we last posted about AcousticBrainz, but a lot of work has been going on in the background. This post will give an overview about some of the things that we’ve achieved in the last year.

Data contributions

Our last blog post was neatly titled “What do 650,000 audio files look like, anyway?” Back then, we thought that this was a lot of submissions. Little did we know… I’m glad to report that we now have over 3.5 million submissions, of which almost 2 million are for unique MBIDs. This is a great contribution and we’d like to thank everyone who submitted data to us.

Dataset and model building

MusicBrainz coder Gentlecat returned to participate in Google Summer of Code last year and developed a new tool to let us create datasets and create new computational models. We’re really excited about how this can allow community members to help us increase the quality of the semantic information we provide in AcousticBrainz. We will make another blog post soon explaining how it works.

We presented an academic overview of AcousticBrainz (PDF) at the 16th International Society for Music Information Retrieval (ISMIR) conference in Malaga, Spain. The feedback from the academic community was very encouraging. Many people were interested in the data and wanted to know what they could do with it. We hope that there will be some new projects announced using the data at this year’s conference.

Integration with other data sources

MusicBrainz and AcousticBrainz don’t exist in a vacuum. One important thing that we need to make sure we do is interact with other researchers and products in the same field. To that end, we started AcousticBrainz Labs, a showcase of some of the experiments that we’re working on in AcousticBrainz. The first thing we have published is a mapping between AcousticBrainz and the Million Song Dataset, that we hope people will use to compare these two datasets.

Database upgrades and Data format changes

We’ve just upgraded to PostgreSQL 9.5 (from 9.3), which allows us to use the new jsonb datatype introduced in PostgreSQL 9.4. This change lets us store feature data more efficiently. We also made some changes to the database schema to let us start creating new data from datasets and computation models.

One result of this is that we are creating a new complete data dump, and stopping the old incremental dumps. We are also taking the opportunity to automate this incremental dump process, which is something that a number of people have asked for.

Another change is that the format of the high level JSON data is changing. This is to better reflect some of the complexities that exist in hosting such a large and varied dataset.

Contribute to AcousticBrainz development

We’re always interested in help from other people to contribute data, code, and ideas to AcousticBrainz. Once again, MetaBrainz is participating in Google’s Summer of Code, and AcousticBrainz is a possible project to work on. If you’re not a student you’re still welcome to work with us.

Write to us in a comment, in IRC, or in our new Discourse category and say hi.

What do 650,000 audio files look like, anyway?

Hot on the heels of our release of the first 650,000 feature files as part of the first release of AcousticBrainz, we are presenting some initial findings based on this dataset.

We thank Emilia Gómez (@emiliagogu), an Associate Professor and Senior Researcher at the Music Technology Group at Universitat Pompeu Fabra for doing this analysis and sharing her results with us. All of these results are based on data automatically computed by our Essentia audio analysis system. Nothing was decided by people. Isn’t that cool?

The MTG recently started the AcousticBrainz http://acousticbrainz.org/ project, in collaboration with MusicBrainz.  Data collection started on September 10th, 2014, and since then a total of 656,471 tracks (488,658 unique ones) have been described with essentia. I have been working for a while with audio descriptors and I followed the porting some of my algorithms to essentia, especially chroma features and key estimation. For that reason, I was curious to get a look this data. I present here some basic statistics. I computed them with the SPSS statistical software.

WHICH KIND OF MUSICAL GENRES DO WE HAVE IN THE COLLECTION?

In order to characterize this dataset, I first thought about genre. In essentia, there are four different genre models: trained on the data by Tzanetakis (2001), another one compiled at the MTG (Rosamerica), Dortmund and a database of Electronic music. Far from providing information on the kind of musical genres, these models seem to be contradictory! For example, in the Tzanetakis dataset “jazz” seems to be the most estimated genre, while the proportion of jazz excerpts is very small in the other models.

Genre estimations using the Tzanetakis dataset
Genre estimations using the Tzanetakis dataset

Genre estimations using the Rosamerica dataset
Genre estimations using the Rosamerica dataset

Genre estimations using the Dortmund dataset
Genre estimations using the Dortmund dataset

Genre estimations using the Electronic dataset
Genre estimations using the Electronic dataset

So in conclusion, we have a lot of jazz (according to the Tzanetakis dataset), electronic music (according to the Dortmund dataset), ambient (according to electronic dataset) and an equal distribution of all generes Rosamerica dataset (which does not include a category for electronic music)….Not very clarifying then! This is definitely something that we will be looking at in more depth.

WHAT ABOUT MOOD THEN?

For Mood characterization, 5 different binary models were trained and computed on the dataset. We observe that there is a larger proportion of non-­acoustic music, non-aggressive, and electronic. It is nice to see that most of the music is not happy and not sad! From this and previous study, I would then conclude that there is a tendency in the AcousticBrazinz dataset for electronic music.

Distribution of accoustic and non-accoustic (e.g. electronic) music
Distribution of accoustic and non-accoustic (e.g. electronic) music

How aggressive our dataset is
How aggressive our dataset is

The amount of electronic music

The amount of electronic music (compare with the acoustic graph above)

...and if the music is happy or not
…and if the music is happy or not

If we check for genre vs mood interactions, there are some interesting findings. We find that Classical is the most acoustic genre and rock is the least acoustic genre (due to its inclusion of electronic instruments):

How much music in each genre is accoustic or not
How much music in each genre is accoustic or not

HOW IS KEY ESTIMATION WORKING?

From a global statistical analysis, we observe that major and minor modes are both represented, and that the most frequent key is F minor / Ab Major or F# minor / A Major. This seems a little strange; A major and E major are very frequent keys in rock music. Maybe there are some issues with this data that need to be looked at.

The keys and modes of the tracks in the database
The keys and modes of the tracks in the database

IS THERE A LINK BETWEEN FEATURES AND GENRE?

I wanted to do some plots on acoustic features vs genres. For example, we observe a small loudness level for classical (cla) music and jazz (jaz), and a high one for dance (dan), hip hop (hip), pop, and rock (roc).

The loudness of songs by genre
The loudness of songs by genre

Finally, it is nice to see the relation between equal-­tempered deviation and musical genre. This descriptor measures the deviation of spectral peaks with respect to equal-­tempered tuning. It’s a very low-­level feature but it seems to be related to genre. It is lower for classical music than for other musical genres.

Variation from equal‐tempered tuning per genre
Variation from equal‐tempered tuning per genre

We also observe that for electronic music, equal tempered deviation is higher than for non-­electronic music/acoustic music. What does this mean? In simple terms, it seems that electronic music tends to ignore the rules of what it means to be “in tune” more than what we might term “more traditional” music.

Variation from equal­‐tempered tuning for songs reported as electronic/non-electronic
Variation from equal­‐tempered tuning for songs reported as electronic/non-electronic

IS THERE A LINK BETWEEN FEATURES AND YEAR?

I was curious to check for historical evolution in some acoustic features. Here are some nice plots on the evolution of number of pieces per year, and some of the most relevant acoustic features. We first observe that most of the pieces belong to the period from 1990’s to nowadays. This may be an artifact of the people who have submitted data to AcousticBrainz, and also of the data that we find in MusicBrainz. We hope that this distribution will spread out as we get more and more tracks.

Distribution of release year for the dataset. 0 represents an unknown year
Distribution of release year for the dataset. 0 represents an unknown year

There does not seem to be a large change of acoustic features as year changes. This is definitely something to look into further to see if any of the changes are statistically significant.

Are the loudness wars true? Can you see a trend?
Are the loudness wars true? Can you see a trend?

Is music getting faster? It doesn't seem so
Is music getting faster? It doesn’t look like it

Songs aren't geting more complex
Songs aren’t geting more complex


We have many more ideas of ways to look at this data, and hope that it will show us some interesting things that we may not have guessed from just listening to it. If you would like to see any other statistics, please let us know! You can download the whole dataset to perform your own analysis at http://acousticbrainz.org/download

Announcing the AcousticBrainz project

MetaBrainz and the Music Technology Group at Universitat Pompeu Fabra are pleased to announce the first public release of the AcousticBrainz project.

http://acousticbrainz.org/

What is AcousticBrainz?
The AcousticBrainz project aims to crowd source acoustic information for all of the music in the world and make it available to the public. The goal of AcousticBrainz is to provide music technology researchers and open source hackers with a massive database of information about music.

AcousticBrainz uses a state of the art research project called Essentia (http://essentia.upf.edu/), developed over the last 10 years at the Music Technology Group.

Data generated from processing audio files with Essentia is collected by the AcousticBrainz project and made available to the public under the CC0 license (public domain). In 6 weeks since its inception, AcousticBrainz contributors have already submitted data for 650,000 audio tracks using pre-release software.

Today we are releasing client programs to submit data to the AcousticBrainz server and our first public release containing audio features for over 650,000 audio files.

What data does it have?
AcousticBrainz contains information called audio features. This acoustic information describes the acoustic characteristics of music and includes low-level spectral information such as tempo, and additional high level descriptors for genres, moods, keys, scales and much more. These features are explained in more detail at http://acousticbrainz.org/sample-data

How can I get it?
You can access AcousticBrainz data via our API. See details at http://acousticbrainz.org/api
We also provide downloadable dumps of the whole dataset. You can download it (all 13 gigabytes!) at http://acousticbrainz.org/download

What can I do with it?
We hope that this database will spur the development of new music technology research and allow music hackers to create new and interesting recommendation and music discovery engines. Here are some ideas of things we would like to see:

  • Music discovery
  • Playlist generation
  • Improving the state of the art in genre recognition
  • Analytics on the musical structure of popular music
  • and more!

This is one of the largest datasets of this kind available for research, and the only one of this size that we know of which contains both freely available data as well as the reference source code used to compute the data.

How can I contribute?
If you are a music researcher, you can help us by contributing to the essentia project. Go to the essentia homepage to see how you can do this. If you do something cool with the data let us know. We’d like to start a “made with AcousticBrainz” page where we will showcase interesting projects.

If you have any audio files, we would love for you to contribute audio features to our project. You can do this by downloading our submission clients from http://acousticbrainz.org/download. We provide clients for Windows, Mac, and Linux.

If you find any bugs or errors in the AcousticBrainz stack please let us know! Report issues to http://tickets.musicbrainz.org/browse/AB.

We can’t wait to see what kind of things you will make with our data.

The AcousticBrainz team.