Preparing for Year In Music Report

The ListenBrainz Year In Music Report is approaching, and in order to make the most out of it, we recommend that users who utilize various import methods complete their listen imports by January 2nd.

To provide the most accurate Year In Music reports it is important that we identify the recordings to which you’ve listened. If we can’t identify your listens correctly, your Year in Music reports will also be incorrect. We aim to automatically identify all the listens that come in, but this challenging task isn’t always carried out with 100% success.

In order to give users more control over the linking of their listens, we have introduced a new feature allowing users to directly link their listens to a MusicBrainz recording from the ListenBrainz website. To correct an incorrectly linked listen, navigate to the listens page and select the Link with MusicBrainz option from the dropdown menu next to a listen.


A dialog box will appear, allowing you to paste a link to the desired MusicBrainz recording. Click the Add mapping button to establish a connection between the recording and all listens with the same track and artist name.


Please note that if the recording was recently added to MusicBrainz, it may take up to 4 hours for the link to start working. While this is an initial version of the feature, we have plans to make the process more user-friendly in the future.

Happy Linking!

Cleaning up the Music Listening Histories Dataset

Hi, this is Prathamesh Ghatole (IRC Nick: “Pratha-Fish”), and I am an aspiring Data Engineer from India, currently pursuing my bachelor’s in AI at GHRCEM Pune, and another bachelor’s in Data Science and Applications at IIT Madras. 

I had the pleasure to be mentored by alastairp and the rest of the incredible team at the MetaBrainz Foundation. Throughout this complicated but super fun project as a GSoC ‘22 contributor! This blog is all about my journey over the past 18 weeks.

In an era where music streaming is the norm, it is no secret that to create modern, more efficient, and personalized music information retrieval systems, the modelling of users is necessary because many features of multimedia content delivery are perceptual and user-dependent. As music information researchers, our community has to be able to observe, investigate, and gather insights from the listening behavior of people in order to develop better, personalized music retrieval systems. Yet, since most media streaming companies know that the data they collect from their customers is very valuable, they usually do not share their datasets. The Music Listening Histories Dataset (MLHD) is the largest-of-its-kind collection of 27 billion music listening events assembled from the listening histories of over 583k last.fm users, involving over 555k unique artists, 900k albums, and 7M tracks. The logs in the dataset are organized in the form of sanitized listening histories per user, where each user has one file, with one log per line. Each log is a quadruple of: 

<timestamp, artist MBID, release-MBID, recording MBID>

The full dataset contains 576 files of about 1GB each. These files are subsequently bundled in sets of 32 TAR files (summing up to ~611.39 GB in size) in order to facilitate convenient downloading.

Some salient features of the MLHD:

  • Each entity in every log is linked to a MusicBrainz Identifier (MBID) for easy linkage to other existing sources.
  • All the logs are time-stamped, resulting in a timeline of listening events.
  • The dataset is freely available and is orders of magnitudes larger than any other dataset of its kind.
  • All the data is scraped from last.fm, where users publicly self-declare their music listening histories.

What is our goal with this project?

The dataset would be useful for observing many interesting insights like:

  • How long people listen to music in a single session
  • The kinds of related music that people listen to in a single session
  • The relationship between artists and albums and songs
  • What artists do people tend to listen to together?

In its current form, the MLHD is a great dataset in itself, but for our particular use-case, we’d like to make some additions and fix a few issues inherently caused due to last.fm’s out-of-date matching algorithms with the MusicBrainz database. (All issues are discussed in detail in my original GSoC proposal)

For example:

  1. The artist conflation issue: We found that the artist MBIDs for commonly used names were wrong for many logs, where the artist MBID pointed to incorrect artists with the same name in the MusicBrainz database. e.g. For the song “Devil’s Radio” by ”George Harrison” (from the Beatles), the MLHD incorrectly points to an obscure Russian hardcore group named “George Harrison” 
  2. Multiple artist credits: The original MLHD provides only 1 single artist-MBID, even in case of recordings with multiple artists involved. We aim to fix that by providing a complete artist credit list for every recording.
  3. Complete data for every valid recording MBID: We aim to use the MusicBrainz database to fetch accurate artist credit lists and release MBIDs for every valid recording MBID, hence improving the quality and reliability of the dataset.
  4. MBID redirects: 22.7% of the recording MBIDs (from a test set of 371k unique recording MBIDs) that we tested were not suitable for our direct use. Of the 22.7% of recording MBIDs, 98.66% MBIDs were just redirected to other MBIDs (that were correct too).
  5. Non-Canonical MBIDs: A significant fraction of MBIDs were not canonical MBIDs. In the case of recording MBIDs, a release-group might use multiple valid MBIDs to represent the release, but there’s always a single MBID that is the “most representative” of the release group, known as a “canonical” MBID.

While the existing redirecting as well as non-canonical MBIDs are technically correct and identical when considered in aggregate, we think replacing these MBIDs with their canonical counterparts would be a nice addition to the existing dataset and aid in better processing. Overall, the goal of this project is to write high-performance python code to resolve the dataset as soon as possible to an updated version, in the same format as the original, but with incorrect data rectified & invalid data removed.

Checkout the complete codebase for this project at: https://github.com/Prathamesh-Ghatole/MLHD 

The Execution

Personally, I’d classify this project as a Data Science or Data Engineering task involving lots of analytics, exploration, cleanup, and moving goals and paths as a result of back-and-forth feedback from stakeholders. For a novice like me, this project was made possible through many iterations involving trial and error, learning new things, and constantly evolving existing solutions to make them more viable, and in line with the goals. Communication was a critical factor throughout this project, and thanks to Alastair, we were able to communicate effectively on the #Metabrainz IRC channel and keep a log of changes in my task journal, along with weekly Monday meetings to keep up with the community.

Skills/Technologies used

  • Python3, Pandas, iPython Notebooks – For pretty much everything
  • NumPy, Apache Arrow – For optimizations
  • Matplotlib, Plotly – For visualizations
  • PostgreSQL, psycopg2 – For fetching MusicBrainz database tables, quick-and-dirty analytics, etc.
  • Linux – For working with a remote Linux server for processing data.
  • Git – For version control & code sharing.

Preliminary Analysis

1. Checking the demographics for MBIDs

We analyzed 100 random files from the MLHD with 3.6M rows and found the following results. In the 381k unique recording MBIDs, ~22.7% were not readily usable, i.e. they had to be redirected, or had to be made canonical. However, of these ~22.7% MBIDs, ~98.66% were correctly redirected to a valid recording MBID using the MusicBrainz database’s “recording” table, implying that only ~0.301% of all UNIQUE recording MBIDs from MLHD were completely unknown (i.e. Didn’t belong to the “recording” table OR have a valid redirect). Similarly, about ~5.508% of all UNIQUE artist MBIDs were completely unknown. (Didn’t belong to the “artist” table OR “artist_gid_redirect” table)

2. Checking for the artist conflation Issue:

There are many artists with exactly the same name. But we were unsure if for such cases last.fm’s algorithms matched the correct artist MBID for a recording MBID every time. To verify this, we fetched artist MBIDs for each recording MBID in a test set and compared it to the actual artist MBIDs present in the dataset. Lo and behold, we discovered that ~9.13% of the cases faced this issue in our test set with 3,76,037 unique cases.

SOLUTION 1

This is how we first tried dealing with the artist conflation issue:

  1. Take a random MLHD file
  2. “Clean up” the existing artist MBIDs and recording MBIDs, and find their canonical MBIDs. (Discussed in detail in the section “Checking for non-canonical & redirectable MBIDs”)
  3. Fetch the respective artist name and recording name for every artist MBID and recording MBID from the MusicBrainz database.
  4. For each row, pass <artist name, recording name> to either of the following MusicBrainz APIs:
    1. https://datasets.listenbrainz.org/mbc-lookup 
    2. https://labs.api.listenbrainz.org/mbid-mapping
  5. Compare artist MBIDs returned by the above API to the existing artist MBIDs in MLHD.
  6. If the existing MBID is different from the one returned by the API, replace it.

However, this method meant making API calls for EACH of the 27bn rows of the dataset. This would mean 27 billion API calls, where each call would’ve at least taken 0.5s. I.e. 156250 days just to solve the artist conflation issue. This was in no way feasible, and would’ve taken ages to complete even if we parallelized the complete process with Apache Spark. Even after all this, the output generated by this API would’ve barely been a fuzzy solution prone to errors.

SOLUTION 2

Finally, we tackled the artist conflation issue by using the MusicBrainz database to fetch artist credit lists for each valid recording MBID using the MusicBrainz database. This enabled us to perform in-memory computations, and completely eliminated the need to make API calls, saving us a lot of processing time. This did not only make sure that every artist MBID corresponded only to its correct recording MBID accurately 100% of the time but also:

  • Improved the quality of the provided artist MBIDs by providing a list of artist MBIDs in case of records with multiple recording MBIDs.
  • Increased the count of release MBIDs in the dataset by 10.19%!
    (Test performed on the first 15 files from the MLHD, summing up to 952229 rows of data)

3. A new contender appears! (Fixing the MBID mapper)

While working out “SOLUTION 1” as discussed in the previous section, we processed thousands of rows of data, and compared the outputs by the mbc-lookup API and mbid-mapping API, and discovered that these APIs sometimes returned different outputs when they should have returned the same outputs. This uncovered a fundamental issue in the mbid-mapping API that was actively being used by listenbrainz to link music logs streamed by users to their respective entities in the MusicBrainz database. We spent a while trying to analyze the depth of this issue by generating test logs and reports for both the mapping endpoints and discovered patterns that helped point to some bugs in the matching algorithms written for the API. This new discovery helped lucifer debug the mapper, resulting in the following pull request: Fix invalid entries in canonical_recording_redirect table by amCap1712 · Pull Request #2133 · metabrainz/listenbrainz-server (github.com)

4. Checking for non-canonical & redirectable MBIDs

To use the MusicBrainz database to fetch artist names and recording names w.r.t. their MBIDs, we first had to make sure MBIDs we used to lookup the names were valid, consistent, and “clean”. This was done by:

  1. Checking if an MBID was redirected to some other MBID, and replacing the existing MBID with the MBID it redirected to.
  2. Finding a Canonical MBID for all the recording MBIDs.

We used the MusicBrainz database’s mapping.canonical_recording_redirect” table to fetch canonical recording MBIDs, and recording_gid_redirect” table to check and fetch redirects for all the recording MBIDs. We first tried mapping SQL query on every row to fetch results, but soon realized it would’ve slowed the complete process down to unbearable levels. Since we were running the processes on “Wolf” (A server at MetaBrainz Inc.) we had access to 128GB of RAM, enabling us to load all the required SQL tables in memory using Pandas, eliminating the need to query SQL tables stored on disk.

5. Checking for track MBIDs being mistaken for recording MBIDs

We suspected that some of the unknown recording MBIDs in the dataset could actually be track MBIDs disguised as recording MBIDs due to some errors in mapping. While exploring the demographics on a test sample set of 381k unique recording MBIDs, we discovered that none of the unknown recording MBIDs confirmed this case. To further verify this problem, we ran the tests on ALL recording MBIDs in the MLHD. To hit 2 birds in one iteration, we also re-compressed every file in the MLHD from GZIP compression to a more modern, ZStandard compression, since GZIP read/write times were a huge bottleneck while costing us 671GB in storage space. This process resulted in:

  • The conversion of all 594410 MLHD files from GZIP compression to ZSTD compression in 83.1 hours.
  • The dataset being reduced from 571 GB -> 268 GB in size. (53.75% Improvement!)
  • File Write Speed: 17.46% improvement.
  • File Read Speed: 39.25% deterioration.
  • Confirmed the fact that no track MBID existed in the recording MBID column of the MLHD.

Optimizations

1. Dumping essential lookup tables from the MusicBrainz database to parquet.

We used the following tables from the MusicBrainz database in the main cleanup script to query MBIDs:

  1. recording: Lookup recording names using recording MBIDs, Get a list of canonical recording MBIDs for lookups.
  2. recording_gid_redirect: Lookup valid redirects for recording MBIDs using redirectable recording MBIDs as index.
  3. mapping.canonical_recording_redirect: Lookup canonical recording MBIDs using non-canonical recording MBIDs as index.
  4. mapping.canonical_musicbrainz_data:Lookup artist MBIDs, and release MBIDs using recording MBIDs as index.

In our earlier test scripts, we mapped SQL queries over the recording MBID column to fetch outputs. This resulted in ridiculously slow lookups where a lot of time was being wasted in I/O. We decided to pre-load the tables into memory using pandas.read_sql(), which added some constant time delay at the beginning of the script, but reduced the lookup timings from dozens of seconds to milliseconds. Pandas documentation recommends using SQLAlchemy connectable to fetch SQL tables into pandas. However, we noticed that pandas.read_sql() with a psycopg2 Connector was 80% faster than pandas.read_sql() with a SQLAlchemy Connector. Even though the pandas officially doesn’t recommend using psycopg2 at all. Fetching the same tables from the database again and again was still slow, so we decided to dump all the required SQL tables to parquet, causing a further 33% improvement in loading time.

2. Migrating the CSV reading function from pd.read_csv() to pyarrow._csv.write_csv():

We started off by using custom functions based on pandas.read_csv() to read CSV files and preprocess them (rename columns, drop rows as required, concatenate multiple files if specified, etc.). Similarly, we used pandas.to_csv() to write the files. However, we soon discovered that these functions were unnecessarily slow, and a HUGE bottleneck for processing the dataset. We were able to optimize the custom functions by leveraging pandas’ built-in vectorized functions instead of relying on for loops to pre-process dataframes once loaded. This brought down the time required to load test dataframes significantly.

pandas.read_csv() and pandas.to_csv() on their own are super convenient, but aren’t super performant. Especially when you need them to compress/decompress files before reading/writing. Pandas’ reading/writing functions come with a ton of extra bells and whistles. Intuitively, we started writing our own barebones CSV reader/writer with NumPy. Turns out this method was far slower than the built-in pandas methods! We tried vectorizing our custom barebones CSV reader using Numba, an open-source JIT compiler that translates a subset of Python and NumPy code into fast machine code. However, this method too failed due to various reasons. (Mostly by my own inexperience with Numba). Finally, we tried pyarrow, a library that provides Python APIs for the functionality provided by the Arrow C++ libraries, including but not limited to reading, writing, and compressing CSV files. This was a MASSIVE success, causing +86.11% in writing speeds and 30.61% improvements in reading speeds even while writing back DataFrames as CSV with ZSTD level 10 compression!

3. Pandas optimizations

In pandas, there are often multiple ways to do the same thing, and some of them are much faster than others due to their implementation. We realized a bit too late, that pandas isn’t that good for processing big data in the first place! But I think we were pretty successful with our optimizations and made the best out of pandas too. Here are some neat optimizations that we did along the way in Pandas.

pd.DataFrame.loc() returns the whole row (a vector of values), but pd.DataFrame.at() only returns single value (a scalar). 

Intuitively, pd.DataFrame.loc() should be faster to search and return a tuple of values as compared to pd.DataFrame.at() since the latter requires multiple nested loops per iteration to fetch multiple values for a single query, whereas the prior doesn’t. However, for our use case, running pd.DataFrame.at() 2x per iteration for fetching multiple values was still ~55x faster than running pd.DataFrame.loc() once for fetching the complete row at once!

Some of the most crucial features that pandas offers are vectorized functions.
Vectorization in our context refers to the ability to work on a set of values in parallel. Vectorized functions just do a LOT more work in a single loop, enabling them to produce results wayy faster than typical for-loops, that operate on a single value per iteration. In pandas, these vectorized functions can mean speeding up operations by as much as 1000x! For MLHD, we fetched artist MBIDs and release MBIDs (based on a recording MBID) as a tuple representing a pair of MBIDs. This meant a tuple for each recording MBID, leaving us with a series of tuples, that we needed to split into two different series. The most simple solution to this issue would be to just use tuple unpacking using python’s built-in zip function as follows:

artist_MBIDs, recording_MBIDs = zip(* series_of_tuples)

For our particular case, we also had to add in an extra step of mapping a “clean up” function to the whole series before unzipping it. The mapping process in the above case was a serious bottleneck, so we had to find something better. However, in we were able to significantly speed up the above process, by avoiding apply/map functions completely, and cleverly utilizing existing vectorized functions instead. The details for the solution can be found at: quickHow: how to split a column of tuples into a pandas dataframe (github.com)

In our first few iterations, we used pandas.Series.isin() to check if a series of recording MBIDs existed in the “recording” table of the MusicBrainz database or not. Pandas functions in general are very optimized and occasionally written in C/C++/Fortran instead of Python, making them super fast. I assumed the same would be the case with pandas.Series.isin(). My mentor suggested that we use built-in Python sets and plain for-loops for this particular case. My first reaction was “there’s no way our 20 lines of Python code are gonna hold up against pandas. Everyone knows running plain for-loops against vectorized functions is a huge no-no”. But, as we kept on trying, the results completely blew me away! Here’s what we tried:

  1. Convert the index (a list) of the “recording” table from the MusicBrainz database to a Python Set.
  2. Iterate over all the recording MBIDs in the dataset using a for-loop, and check if MBIDs exist in the set using Python’s “in” keyword.

For 4,738,139 rows of data, pandas.Series.isin() took 13.1s to process. The sets + for-loop method took 1.03s! (with an additional, one-time cost of 6s to convert the index list into a Set). The magic here was done by converting the index of the “recording” table into a Python Set, which essentially puts all the values in a hashmap (which only took a constant 6 seconds at the start of the script).

A hashmap meant reducing the time complexity for search values to O(1). On the other hand, pandas.Series.isin() was struggling with at least O(n) time complexity, given that it’s essentially a list search algorithm working on unordered items. This arrangement meant only a one-time cost of converting the index to a Python Set at the start of the script, and a constant O(1) time complexity to loop through and search for items.

Final Run

As of October 20, 2022 – We’ve finally started testing for all 594410 MLHD files to process and re-write ~27 billion rows of data. The output for a test performed on the first 15 files from the MLHD, summing up to 952229 rows of data is as follows:

Here, the cleanup process involves: Fetching redirects for recording MBIDs; Fetching canonical MBIDs for recording MBIDs; Fetching artist credit lists and release MBIDs based on recording MBIDs; and Mapping the newly fetched artist credit lists and release MBIDs to their respective recording MBIDs.

The above process is completely recording MBID oriented in order to maintain quality and consistency. This means completely wiping off the data in the artist_MBID and release_MBID columns in order to replace them with reliable data fetched from the MusicBrainz database. This also means that the above process will bring a significant change in the demographics of various entities (artist MBIDs, release MBIDs, and recording MBIDs) in the final version of the dataset.

Even though the impact of changing demographics varies from file to file (depending on the user’s tendency to listen to certain recordings repeatedly), here are some statistics based on the first 15 files in the MLHD, before and after processing:

For a complete test set with 952,229 input rows, the shrinkage is as follows:
Given an input of 952,229 rows, the row count of the original MLHD shrinks to 789,788 rows after dropping rows with empty recording MBID values. (17.06% Shrinkage). After processing, given the same input, the row count of the processed MLHD shrinks to 787,690 rows after dropping rows with empty recording MBID values. (17.28% Shrinkage). Now for a fair comparison, let’s first drop all rows with empty recording MBID values from the original, as well as the processed dataset. This gives us 787,690 in the processed dataset and 789,788 in the original dataset. The absolute shrinkage between the original and processed dataset is as follows:

Abs_shrinkage = ((789788 - 787690) / 789788) * 100 = 0.27%

Therefore, the cleaning process only resulted in a shrinkage of 0.27% of the existing recording MBIDs in the MLHD! Note that this stat is also in line with our previous estimate about how ~0.301% of all recording MBIDs were completely unknown. As per the original MLHD research paper, about ~65% of the MLHD contains at least the recording MBID. We might have the option to drop the rest of the 35% of the dataset or keep the data without recording-MBIDs as it is. Out of the 65% of the MLHD with recording MBIDs, ~0.301% of the recording MBIDs would’ve to be dropped (since they’re unknown). This leaves us with: 27bn – (35% of 27bn) – (0.3% of 65% of 27bn) = 12.285bn rows of super high quality data!

Now similarly, let’s compare the row count shrinkages for different columns.

  1. Number of counts of not-empty recording MBIDs SHRINKED by 0.27%.
  2. Number of counts of not-empty release MBIDs EXPANDED by 14.08%.
  3. Number of counts of not-empty artist MBIDs SHRINKED by 13.36%

Given an average processing time per 10,000 rows of 0.2168s, we estimate the time taken to process the entire dataset will be 27,00,00,00,000 / 10,000 * 0.2168 / 3600 / 24 = 6.775 days or 162 hours

Primary Outcomes

  1. The MLHD is currently set to be processed with an ETA of ~7 days of processing time.
  2. I was able to generate various reports to explore the impact of the “artist conflation issue” in the MLHD. These extra insights and reports uncovered a few issues within the MusicBrainz ID Mapping lookup algorithm, which resulted in lucifer fixing Fix invalid entries in canonical_recording_redirect table by amCap1712 · Pull Request #2133 · metabrainz/listenbrainz-server (github.com)

Miscellaneous Outcome

How I got picked as a GSoC candidate without a single OSS PR to my name beforehand is beyond me, but with the help of alastairp and lucifer, I was able to solve and merge PRs for 2 issues in the listenbrainz-server as an exercise to gain get to know the listenbrainz codebase a little better.

My Experience

This journey has been nothing but amazing! The sheer amount of things that I learned during these past 18 weeks is ridiculous. I really appreciate the fun people and work culture here, which was even more apparent during the MetaBrainz Summit 2022 where I had the pleasure to see the whole team in action on a live stream.

Coming from a Music Tech background and having extensively used MetaBrainz products in the past, it was a surreal experience being able to work with these supersmart engineers who have worked on technologies I could only dream of making. I often found myself admiring my seniors as well as peers for their ability to come up with pragmatic solutions with veritable speed and accuracy, especially lucifer, whose work ethic inspired me the most! I hope some of these qualities eventually rub off on me too 🙂

I’d really like to take time to appreciate my mentor, alastairp for always being super supportive, and precise in his advice, and helping me push the boundaries whenever possible. I’d also like to thank him for being very considerate, even through times when I’ve been super unpredictable and unreliable, and not to mention, giving me an opportunity to work with him in the first place!

Suggestions for aspiring GSoC candidates

  • Be early.
  • Ask a lot of questions, but do your due diligence by exploring as much as possible on your own as well.
  • OSS codebases might seem ginormous to most students with limited work experience. Ingest the problem statement bit by bit, and slowly work your way toward potential solutions with your mentor.
  • Believe in yourself! It’s not a mission impossible. You always miss the shots that you don’t take.

You can contact me at:
IRC: “Pratha-Fish” on #Metabrainz IRC channel
Linkedin: https://www.linkedin.com/in/prathamesh-ghatole
Twitter: https://twitter.com/PrathameshG69 
GitHub: https://github.com/Prathamesh-Ghatole

GSoC’22: Personal Recommendation of a track

Hi Everyone!

I am Shatabarto Bhattacharya (also known as riksucks on IRC and hrik2001 on Github). I am an undergraduate student pursuing Information Technology from UIET, Panjab University, Chandigarh, India. This year I participated in Google Summer of Code under MetaBrainz and worked on implementing a feature to send personal recommendation of a track to multiple followers, along with a write up. My mentors for this project were Kartik Ohri (lucifer on IRC) and Nicolas Pelletier (monkey on IRC)

Proposal

For readers, who are not acquainted with ListenBrainz, it is a website where you can integrate multiple listening services and view and manage all your listens at a unified platform. One can not only peruse their own listening habits, but also find other interesting listeners and interact with them. Moreover, one can even interact with their followers (by sending recommendations or pinning recordings). My GSoC proposal pertained to creating one more such interaction with your followers, sending personalized recommendations to your followers.

Community Bonding Period

During the community bonding period, I was finishing up working on implementing feature to hide events in the feed of the user and correcting the response of missing MusicBrainz data API. I had also worked on ListenBrainz earlier (before GSoC), and had worked on small tickets and also had implemented deletion of events from feed and displaying missing MusicBrainz data in ListenBrainz.

Coding Period (Before midterm)

While coding, I realized that the schema and the paradigm of storing data in the user_timeline_event that I had suggested in the proposal won’t be feasible. Hence after discussion with lucifer in the IRC, we decided to store recommendations as JSONB metadata with an array of user IDs representing the recommendees. I had to scratch my brain a bit to polish my SQL skills to craft queries, with help and supervision from lucifer. There was also a time where the codebase for the backend that accepted requests from client had a very poorly written validation system, and pydantic wouldn’t work properly when it came to assignment after a pydantic data-structure had been declared. But after planning the whole thing out, the backend and the SQL code came out eloquent and efficient. The PR for that is here.

{
     "metadata": {
        "track_name": "Natkhat",
        "artist_name": "Seedhe Maut",
        "release_name": "न",
        "recording_mbid": "681a8006-d912-4867-9077-ca29921f5f7f",
        "recording_msid": "124a8006-d904-4823-9355-ca235235537e",
        "users": ["lilriksucks", "lilhrik2001", "hrik2001"],
        "blurb_content": "Try out these new people in Indian Hip-Hop! Yay!"
    }
 }

Example POST request to the server, for personal recommendation.

Coding Period (After midterm)

After the midterm, I started working on creating the modal. At first my aim was to create a dropdown menu for search bar using Bootstrap (as most of the code had bootstrap rather than being coded from scratch). But after a while I consulted with lucifer and monkey and went for coding it from scratch. I had also planned to implement traversing search results using arrow keys, but the feature couldn’t be implemented in due time. Here are screenshots of what was created in this PR.

Accessing menu button for personal recommendation

A modal will appear with a dropdown of usernames to choose for personal recommendation

Modal with usernames chosen and a small note written for recommendation

If one has grokked my proposal, they might already notice that the UI/UX of the coded modal is different from the one proposed. This is because while coding it, I realized that the modal needs to not only look pretty but also go well with the design system. Hence the pills were made blue in color (proposed color was green). While I was finishing up coding the view for seeing recommendations in the feed, I realized that the recommender might want to see the people they had recommended. So, I asked lucifer and monkey, if they would like such feature, and both agreed, hence this UI/UX was born:

Peek into the feed page of recommender

What a recommendee sees

Special thanks to CatQuest and aerozol for their feedbacks over the IRC regarding the UI/UX!

Experience

I am really glad that I got mentorship from lucifer and monkey. Both are people whom I look up to, as they both are people who are not only good at their crafts but are also very good mentors. I really enjoy contributing to ListenBrainz because almost every discussion is a round table discussion. Anyone can chime in and suggest and have very interesting discussions on the IRC. I am very glad that my open source journey started with MetaBrainz and its wholesome community. It is almost every programmer’s dream to work on projects that actually matter. I am so glad that I had the good luck to work on a project that is actually being used by a lot of people and also had the opportunity to work on a large codebase where a lot of people have already contributed. It really made me push my boundaries and made me more confident about being good at open source!

Summer of Code: But wait, we have another participant!

This year’s Google Summer of Code participant selection process created a situation that we’ve never encountered before: Two participants put in excellent proposals for the same project and both participants did a very good job of engaging with the community. But there was one difference between the two — one participant had engaged with us months earlier, written a whole new feature, saw it through the release process and got the feature into production.

This compelled us to accept the participant with whom we had already built a rapport. But collectively we felt really really bad about the fact that the other participant, Chinmay Kunkikar, would be rejected from Summer of Code and not work with us.

Fortunately we had recently earned 15,000GBP from our participation in the ODI Peer Learning Network 2, which we decided to spent on contributions to Open Source and musicians that our team loves. When the suggestion came up that we could create an internship on the spot that more or less follows the concept of Summer of Code, and that we could take on Chinmay and knock out yet another project during the summer, we jumped on the idea.

And with that I am pleased to announce that Chinmay will take on the “Upcoming and new releases page” project for ListenBrainz. This project will show a timeline of upcoming music releases and releases that have been recently released, complete with the ability to play these releases in the page.

Our team feels strongly about Chinmay as well as this new feature, so we’re excited that we’re taking on this 8th participant for this summer.

Welcome Chinmay!

Welcoming GSoC 2022 students!

Thank you to everyone who submitted a proposal to MetaBrainz for this year’s Summer of Code!

This year, we have selected seven projects. The chosen students and projects are:

Ansh Goyal – BookBrainz and CritiqueBrainz: CritiqueBrainz reviews for BookBrainz entities

Ashutosh Aswal – MusicBrainz Android App: Adding BrainzPlayer in Android App

Prathamesh Ghatole – ListenBrainz: Clean Up The Music Listening Histories Dataset

Shatabarto – ListenBrainz: Send a track to another user as a personal recommendation

Shubh – BookBrainz: Unified Creation Form

skelly37 – Picard: Make Picard work in single instance mode, then improve existing error handling and crash info.

Yuzie – ListenBrainz: Add Timezone support to ListenBrainz

Welcome aboard and congratulations!
Your contributions to MetaBrainz projects and the community are impressive and admirable. We look forward to work with you over the summer and see your work come to fruition 🙂

Communication is the key to success in a closely knit community as ours. Always feel free to reach out to your mentors and other members of the community if you face any issues (stuck in your code, health or family emergencies, etc.). We, the mentors, are here to support and help you.

For all the students that applied but did not get accepted: we appreciate your applications, and even if you did not make the cut this year, we hope that you will stick around and apply with us again next year when we know you better – and you know us better.

akshaaatt, alastair, lucifer, mayhem, monkey, outsidecontext, zas

MusicBrainz App 2021 Updates

Greetings, Everyone!

2021 has been a great year for the MusicBrainz Android App. The app has received updates regularly throughout the year!

Now that we are very close to 10,000+ users on the Playstore, it is evident that the app caters to the needs of a number of users, which is wonderful!

We have plans to introduce new features, involving those of ListenBrainz and CritiqueBrainz in the app. We are confident that the app serves its purpose of introducing everyone to the MetaBrainz world very soundly.

The app now features both a light and dark mode for the users!

Notable feature updates made this year can be found at https://blog.metabrainz.org/2021/07/30/musicbrainz-app/

During the end of the year, we have made some remarkable technical updates to the codebase by introducing Fastlane to the app. This eases the process for the developers and allows us to make a release with the click of a button. This means now we can have a production release every month, day, or hour.

Although going strong and steady, the MusicBrainz developers would love more contributors to join in and share their knowledge with us, while we dive deep into the world of music.

Play Store: MusicBrainz – Apps on Google Play

F-Droid: MusicBrainz | F-Droid – Free and Open Source Android App Repository

Github: metabrainz/musicbrainz-android

Thank you!

Acoustic similarity in AcousticBrainz

We’re pleased to announce that we have just released acoustic similarity in AcousticBrainz. Acoustic similarity is a technique to automatically identify which recordings sound similar to other recordings, using only the recordings themselves, and not any additional metadata. This feature is available via the AcousticBrainz API and the AcousticBrainz website, from any recording page. General documentation on acoustic similarity is available at https://acousticbrainz.readthedocs.io/similarity.html.

This feature is based on work started by Philip Tovstogan at the Music Technology Group, the research group that provides the essentia feature extractor that powers AcousticBrainz. The work was continued by Aidan Lawford-Wickham during Summer of Code 2019. Thanks Philip and Aidan for your work!

From the recording view on AcousticBrainz, you can choose to see similar recordings and choose which similarity metric you want to use. Then, a list of recordings similar to the initial recording will be shown.

These metrics are based on different musical features that the AcousticBrainz feature extractor identifies in the audio file. Some of these features are related to timbral characteristics (generally, what something sounds like), Rhythmic (related to tempo or perceived pulses), or AcousticBrainz’s high-level features (hybrid features that use our machine learning system to identify features such as genre, mood, or instrumentation).

One thing that we can immediately see in these results is that the same recording appears many times. This is because AcousticBrainz stores multiple different submissions for the same MBID, and will sometimes get submissions for the same recording with different MBIDs if the data in MusicBrainz is like this. This is actually really interesting! It shows us that we are successfully identifying that two different submissions in AcousticBrainz as being the same using only acoustic information and no metadata. Using the API you can ask to remove these duplicated MBIDs from the results, and we have some future plans to use MusicBrainz metadata to filter more of these results when needed.

What’s next?

We haven’t yet performed a thorough evaluation of the quality of these similarity results. We’d like people to use them and give us feedback on what they think. In the future we may look at performing some user studies in order to see if some specific features tend to give results that people consider “more” similar than others. AcousticBrainz has a number of additional features in our database, and we’d like to experiment with these to see if they can be used as similarity metrics as well.

The fact that we can identify the same recording as being similar even when the MusicBrainz ID is different is interesting. It could be useful to use this similarity to identify when two recordings could be merged in MusicBrainz.

The data files used for this similarity are stand-alone, and can be used without additional data from AcousticBrainz or MusicBrainz. We’re looking at ways that we can make these data files downloadable so that developers can use them without having to query the AcousticBrainz API. If you think that you might be interested in this, let us know!

Congratulations GSoC 2021 students!

Congratulations and thank you to everyone who submitted a project with MetaBrainz for this year’s Summer of Code!

This year, the selected projects are:

Ritiek Malhotra
MusicBrainz – Complete Rust binding for the MusicBrainz API

Akash Gupta
BookBrainz – Implement a “Series” entity

Akshat Tiwari
Musicbrainz Android App – Dawn of Showdown

Jason Dao
ListenBrainz –  Pin Tracks & Review Tracks Through CritiqueBrainz

Yang Yang
MusicBrainz – Push the URL relationship editor to the next level

Welcome to the team, and congratulations!
In these troubled times it is all the more impressive that you all mustered the focus and determination to work on proposals, contribute to MetaBrainz projects and integrate with the community.

In our small and tightly knit team and community, communication is key!
If you run into any kind of issue (stuck in your code, starting a part-time job, health or family emergencies, etc.) don’t hesitate to contact your mentor as early as possible to find a solution; we’re here to support you.

We mentors all look forward to working with you before, during and after the summer, guiding you to success and helping you learn and improve your skills!

ruaok, yvanzo, mr_monkey, lucifer and oknozor

Thank you for your continued support, Google!

We’ve recently received our annual $30,000 support from Google. The brings the total amount donated by Google’s Open Source Programs Office to us to over $470,000 — hopefully next year we’ll cross the half million dollar threshold!

I can’t quite express my gratitude for this level of support! Without Google’s help, especially early on, MetaBrainz may never have made it to sustainability. Google has helped us in a number of ways, including Google Code-In and Summer of Code — all of these forms of support have shaped our organization quite heavily over the past 15 or so years.

Thank you to Google and everyone at the Google Open Source Programs Office — we truly appreciate your support over the years!

Please nominate us for the Open Publishing Awards!

We’ve recently found out about the Open Publishing Awards::

The goal of the inaugural Open Publishing Awards is to promote and celebrate a wide variety of open projects in Publishing.

All content types emanating from the Publishing sector are eligible including Open Access articles, open monographs, Open Educational Resource Materials, open data, open textbooks etc.

Open data? That’s us! We’ve got a pile of it and if you like the work we do, why not nominate us for an award?

Thanks!