We’ve been accepted to Google’s Summer of Code program 2023!

I’m very pleased to announce that the MetaBrainz Foundation has been accepted into Google’s Summer of Code program for 2023. This amazing project has been incredibly influential for us and our teams, so we’re pleased to be part of it for another round.

Anyone wishing to participate in the program should carefully read the terms for contributors and if you are eligible, go ahead and take a look at our Summer of Code landing page where you can find our project ideas that we listed for this year. Our landing page will tell you what we require of our participants and how to pick up a project.

Good luck to all who are interested in participating!

Fresh Releases – My (G)SoC journey with MetaBrainz

For an open source software enthusiast like me who has contributed little pieces of code and documentation to various projects for almost a decade, the idea of applying for Google Summer of Code has always been exciting and intimidating because of its grand nature. After getting some experience in web development over the past year, I decided to not give in to my self-doubts and applied for the GSoC’22 with confidence and zeal. I am Chinmay Kunkikar from India, and I would like to talk about my project with the MetaBrainz Foundation – Fresh Releases, and take you on my journey through the GSoC 2022 program.

MusicBrainz is the largest structured online database of music metadata. Today, a myriad of developers leverage this data to build their client applications and projects. According to MusicBrainz Database statistics, 2022 alone saw a whopping 366,680*, releases, from 275,749 release groups, and 91.5% of these releases have cover art.  Given that it has a plethora of useful data about music releases available, but has no useful means to visually present it to general users, the idea of building the Fresh Releases page was born.

* as of 2022-11-30

Our objective with this project was, therefore,
  1. To make music discovery easier for the users by presenting the data available from the MusicBrainz database.
  2. To use a user’s listening habits to show them personalized music release suggestions.

Now let me take you one by one through the process of execution of the aforementioned idea. First comes the design process and choices for this page, then a discussion over the implementation of the card grid, the filters component, the timeline component, the user-specific releases page, and how they were made responsive. Later on, we can talk about some pains of testing React Hooks with the Enzyme library, and an accident I had with git. We will also see how we plan to improve Fresh Releases in the future. So let’s begin!

The design process

It was a natural design choice to represent a music release with a card as it is an accepted practice on modern music apps and websites. A release card shows the metadata of a release like the date, name, release group type, artist(s), and cover art in the middle.

A Release Card

Iteration 1 – The initial approach was to show a grid of release cards with an infinite scroll. The filters were of two kinds – One where a user can switch between Upcoming Releases, New Releases, and This Week’s Releases from a button group. And the user can filter releases based on the Release Group by selecting one from a dropdown menu. To limit loading hundreds of results at a time, a Show more button was placed at the bottom of the page. This design approach was similar to the Charts tab of the ListenBrainz user page.

It was pointed out during reviews that using dropdowns for the Release Group filters and having button groups to separate New releases from Upcoming releases were unintuitive choices. We also thought of having a filter that will remove all non-cover art releases from the page for a more visual exploration of music (good one, mayhem).

Iteration 2 – To keep the UI simple but interesting, we thought of showing today’s releases in the middle of the screen, allowing users to scroll up or down to see past or future releases respectively. The practicality and the user experience of this idea were a concern initially but the implementation of similar ideas in apps like Apple’s Time Machine app was convincing enough to make us go forward with it. This time, we used the classic Holy grail layout to design the page. The card grid will be shown in the main content column, the left sidebar will be used for filters, and the right sidebar will have a timeline slider component to scroll the grid up or down. This rectified the concerns we had earlier –

  1. The dropdown menu for filters is gone.
  2. Adopting the past/future scrolling idea resulted in an intuitive UI, getting rid of the filters that separated New releases from Upcoming releases.
  3. The toggle to hide releases without a cover art can be accommodated in the filters column with this layout.

Responsive layout – Thanks to the Holy grail layout, making the page responsive for mobile and tablet would be effortless. We transpose the layout such that the left sidebar is stacked horizontally on the top of the main content section and the right sidebar stacks horizontally below it, leaving the header and footer unchanged.

A set of two buttons were later added at the top of the page to switch between sitewide (or global) releases and user-specific releases (Discussed in detail in a later section).

API Design

We built two API endpoints – one for the sitewide releases and the second one for user-specific releases.

  1. Sitewide fresh releases – This endpoint was built around the idea of the timeline feature of the UI. It optionally accepts a pivot date as an argument to show releases around that date. It also optionally accepts the number of days as an argument to show releases from days before and after the pivot date.
GET /1/explore/fresh-releases

Parameters
1. release_date – Fresh releases will be shown around this pivot date. The default is today’s date.
2. days – The number of days of fresh releases to show. Max 30 days.

Sample response

{
  "artist_credit_name": "Tar Blossom",
  "artist_mbids": [
    "313ab6d1-44e2-49eb-92e6-9e9ad2554bcd"
  ],
  "caa_id": 31955354514,
  "release_date": "2022-02-20",
  "release_group_mbid": "eadb43dd-9c2d-48b4-bf0c-4bb6baa61eb5",
  "release_group_primary_type": "Album",
  "release_mbid": "41c8921a-5fe9-4c15-ac62-0a7525271b5c",
  "release_name": "Of Mountains and Suns"
}
  1. User-specific Fresh Releases – This endpoint fetches releases for the current user within a month. This endpoint accepts no arguments. We will discuss more about User-specific releases in a separate section.
GET /1/user/{username}/fresh_releases

Some data sanitization

MusicBrainz, sometimes while collecting data from multiple sources, can create separate MBIDs of a release if there are minor changes in the metadata from two sources, resulting in duplicates of a release.

For example, {..." release_name": "Waterslide, Diving Board, Ladder to the Sky"} and {..." release_name": "Waterslide, Diving Board, Ladder To The Sky"} (notice the to the). Such results were deduplicated using lodash.uniqBy().

Filters section

Filters help users shortlist releases from specific categories, for example, Albums, Singles, or Remix, among others.

The filters component is divided into two sections – A Hide releases without coverart toggle button and a list of release group types. This list is a dynamically generated array of unique release group types from the API response object.
Multiple filters can be selected to show releases with a combination of filters.
The initial logic for the coverarts-only toggle required a lot of prop drilling through multiple components. To overcome this anti-pattern, we added the caa_id property to the API response. The caa_id is the Cover Arts Archive ID, which is available only for releases that have a valid cover art. The coverart-only toggle can thus use this property to hide the releases that don’t have a caa_id.

The Timeline

We were not sure what basic component or library should be used to implement the timeline. After struggling to find an implementation that matches our needs, a suggestion came from monkey to use the basic HTML range slider element. After contemplating this suggestion, we decided to use the rc-slider library which is based on the HTML range slider but is more React-friendly with additional useful features and styles. We got help from monkey again for the scrolling logic for the slider.

The slider shows marks with dates on them. Clicking on a date will trigger the changeHandler() function that accepts a percentage value from the slider’s current position and returns the position to scroll to on the page. This scrolls the page to the respective date. This was made possible by the createMarks() function that calculates a percent value for the number of releases per date in the releases list. This function creates an object that the slider uses to create the marks on the slider. The handleScroll() is a debounced function triggered every time the user manually changes the scroll position.

User-specific Fresh Releases page

User-specific releases or User Fresh Releases will show releases from artists that the user has listened to before. It uses a confidence score, which is the number of times the user has listened to an artist, to rank the releases in decreasing order. A user can switch between the sitewide releases and user-specific releases using the Pill buttons at the top of the page. If the user is not logged in, they will only see the sitewide releases.

Making the page responsive

Since ListenBrainz uses Bootstrap as its base styling framework, we used the Bootstrap 3 breakpoints with an additional breakpoint from Bootstrap 4 (576px) to make the page responsive for multiple screen sizes. The number of columns in the grid was pretty straightforward; we started with two columns and kept adding an extra column per breakpoint. The filters and timeline components that are vertical on desktop screens, change their orientation to horizontal on the screens up to the md breakpoint. As a future enhancement, we have plans for the timeline to look and behave like Android’s fastscroll widget (thanks to aerozol for the suggestion).

Tests

ListenBrainz uses snapshot testing for frontend to make sure there are no unexpected changes to the UI. While writing unit tests for mocking the API calls was a cakewalk, we struggled with writing tests for rendering and mocking the UI components for snapshot testing. A combination of Jest and Enzyme is used to test React components, which works quite well for React class components, but quickly becomes a nightmare when testing functional components that Fresh Releases use. Enzyme provides no APIs to test and mock hooks like the useState hook. Despite having no support for hooks, surprisingly, it can run the code inside of the hooks themselves. But there is no easy way to mock the useState setter function. As a result, there are still half-written tests of Fresh Releases. This limited support for newer features of React will only worsen over time because Enzyme is no longer actively maintained. There are discussions in the team to move away from Enzyme in favor of other testing libraries, and we believe there are good reasons to do so.

The incident with git

I messed with git reset –-hard and git push –-force on my working branch, wiping away the entire commit history from the local branch and the remote branch. We used git reflog to recover the lost commits. Git reflog is similar to git log, but instead of a list of commits of the current HEAD, it shows the list (or log) of times when HEAD (or reference) itself was changed. We checked out the previous HEAD to a new branch and all of the commit history was seen again (thanks again, lucifer). An important lesson learned that night was to never play around with git reset unless you’re 100% sure what you are doing. Use separate branches to test the commands that can mutate your commit history. But also do not panic if you do run into such situations. There is a high chance git has cached the history somewhere locally.

Git will never fail to surprise you, no matter how experienced you are in using it.

Future improvements and enhancements

To add to the list in the LB-1172 ticket,

  1. Release card grid – Remove all text content from the cards and keep just the coverarts when the “Hide releases without coverarts” filter is set. The text can be shown on the cover art when the Release Card is hovered over. And the name of the filter will be changed to “Show coverarts only.” (suggestion by aerozol)
  2. Integration of BrainzPlayer on the page will add the ability to play new releases directly on the page. To quote monkey, “the solution would be to redirect users to a playable album page on LB. For example, listenbrainz.org/player/release/f5d6d909-06dc-4811-8e13-811e6af31b82 “.
  3. Performance optimizations – One issue Fresh Releases faces is the rendering of hundreds (if not thousands) of DOM nodes on the page. This can significantly slow down the page and the browser. The temporary solution we have used is to limit the number of release days shown. But we want the page to be able to show releases of up to a month. This issue can be solved either by adding o “load more” buttons or by virtualizing or “windowing” the cards grid component using  libraries like react-window.
  4. Implement aerozol’s idea to modify the timeline on mobile screens to look and behave like Android’s fastscroll widget (discussed above).
  5. Add more tests to test combinations of filters.

My journey with MetaBrainz Foundation

While browsing the organizations, I recognized that the MetaBrainz Foundation is the brains (pun intended) behind MusicBrainz Picard, the handy tool that has kept my music library clean for years. After reading about their projects, my interest in the ListenBrainz project was piqued because I learned that it is essentially an open source alternative to Last.FM, a service where I enjoy perusing my music listening habits.

During the community bonding period of GSoC, I set up and played around with the ListenBrainz codebase, as well as closed a few tickets. During this time, I worked on adding a tooltip to the BrainzPlayer progress bar and updating node dependencies to the most recent versions (more complicated than it sounds).

Here is a list of all of my pull requests – metabrainz/listenbrainz-server/pulls.

Even though my original project proposal was not selected as a GSoC project, the MetaBrainz team was so impressed with my work that they chose to create an internship position exclusively for me. Fresh Releases was not an official GSoC project, but the team nonetheless considered it as such. You can read the story in a previous blog post.

My experience & learnings

I’ve always admired open source and the opportunities it provides. Working on projects like ListenBrainz has pushed me to contribute more to open source. In these past few months, apart from working on Fresh Releases, I also had a chance to work on other parts of the codebase. That boosted my confidence further in working on large codebases. I am fortunate to work with mayhem, lucifer and monkey, the lead developers who work in open source environments and require the extra skill of connecting with the community and being patient with new enthusiasts. monkey’s thorough code reviews motivated me to work on the project more. He would explain concepts and then provide short code snippets to show me how to implement them in the code. The entire MetaBrainz community is motivated and a joy to work with. Every design and technical decision is backed up by well-thought-out recommendations and open discussions on IRC. Developers will take note of your suggestions. Your efforts will be recognised. Working here has been a lot of fun!

P.S. Did I mention that they printed the proposals of all the selected students and put them up on the MetaBrainz office wall?!

My proposal along with others’ on the office wall!

Cleaning up the Music Listening Histories Dataset

Hi, this is Prathamesh Ghatole (IRC Nick: “Pratha-Fish”), and I am an aspiring Data Engineer from India, currently pursuing my bachelor’s in AI at GHRCEM Pune, and another bachelor’s in Data Science and Applications at IIT Madras. 

I had the pleasure to be mentored by alastairp and the rest of the incredible team at the MetaBrainz Foundation. Throughout this complicated but super fun project as a GSoC ‘22 contributor! This blog is all about my journey over the past 18 weeks.

In an era where music streaming is the norm, it is no secret that to create modern, more efficient, and personalized music information retrieval systems, the modelling of users is necessary because many features of multimedia content delivery are perceptual and user-dependent. As music information researchers, our community has to be able to observe, investigate, and gather insights from the listening behavior of people in order to develop better, personalized music retrieval systems. Yet, since most media streaming companies know that the data they collect from their customers is very valuable, they usually do not share their datasets. The Music Listening Histories Dataset (MLHD) is the largest-of-its-kind collection of 27 billion music listening events assembled from the listening histories of over 583k last.fm users, involving over 555k unique artists, 900k albums, and 7M tracks. The logs in the dataset are organized in the form of sanitized listening histories per user, where each user has one file, with one log per line. Each log is a quadruple of: 

<timestamp, artist MBID, release-MBID, recording MBID>

The full dataset contains 576 files of about 1GB each. These files are subsequently bundled in sets of 32 TAR files (summing up to ~611.39 GB in size) in order to facilitate convenient downloading.

Some salient features of the MLHD:

  • Each entity in every log is linked to a MusicBrainz Identifier (MBID) for easy linkage to other existing sources.
  • All the logs are time-stamped, resulting in a timeline of listening events.
  • The dataset is freely available and is orders of magnitudes larger than any other dataset of its kind.
  • All the data is scraped from last.fm, where users publicly self-declare their music listening histories.

What is our goal with this project?

The dataset would be useful for observing many interesting insights like:

  • How long people listen to music in a single session
  • The kinds of related music that people listen to in a single session
  • The relationship between artists and albums and songs
  • What artists do people tend to listen to together?

In its current form, the MLHD is a great dataset in itself, but for our particular use-case, we’d like to make some additions and fix a few issues inherently caused due to last.fm’s out-of-date matching algorithms with the MusicBrainz database. (All issues are discussed in detail in my original GSoC proposal)

For example:

  1. The artist conflation issue: We found that the artist MBIDs for commonly used names were wrong for many logs, where the artist MBID pointed to incorrect artists with the same name in the MusicBrainz database. e.g. For the song “Devil’s Radio” by ”George Harrison” (from the Beatles), the MLHD incorrectly points to an obscure Russian hardcore group named “George Harrison” 
  2. Multiple artist credits: The original MLHD provides only 1 single artist-MBID, even in case of recordings with multiple artists involved. We aim to fix that by providing a complete artist credit list for every recording.
  3. Complete data for every valid recording MBID: We aim to use the MusicBrainz database to fetch accurate artist credit lists and release MBIDs for every valid recording MBID, hence improving the quality and reliability of the dataset.
  4. MBID redirects: 22.7% of the recording MBIDs (from a test set of 371k unique recording MBIDs) that we tested were not suitable for our direct use. Of the 22.7% of recording MBIDs, 98.66% MBIDs were just redirected to other MBIDs (that were correct too).
  5. Non-Canonical MBIDs: A significant fraction of MBIDs were not canonical MBIDs. In the case of recording MBIDs, a release-group might use multiple valid MBIDs to represent the release, but there’s always a single MBID that is the “most representative” of the release group, known as a “canonical” MBID.

While the existing redirecting as well as non-canonical MBIDs are technically correct and identical when considered in aggregate, we think replacing these MBIDs with their canonical counterparts would be a nice addition to the existing dataset and aid in better processing. Overall, the goal of this project is to write high-performance python code to resolve the dataset as soon as possible to an updated version, in the same format as the original, but with incorrect data rectified & invalid data removed.

Checkout the complete codebase for this project at: https://github.com/Prathamesh-Ghatole/MLHD 

The Execution

Personally, I’d classify this project as a Data Science or Data Engineering task involving lots of analytics, exploration, cleanup, and moving goals and paths as a result of back-and-forth feedback from stakeholders. For a novice like me, this project was made possible through many iterations involving trial and error, learning new things, and constantly evolving existing solutions to make them more viable, and in line with the goals. Communication was a critical factor throughout this project, and thanks to Alastair, we were able to communicate effectively on the #Metabrainz IRC channel and keep a log of changes in my task journal, along with weekly Monday meetings to keep up with the community.

Skills/Technologies used

  • Python3, Pandas, iPython Notebooks – For pretty much everything
  • NumPy, Apache Arrow – For optimizations
  • Matplotlib, Plotly – For visualizations
  • PostgreSQL, psycopg2 – For fetching MusicBrainz database tables, quick-and-dirty analytics, etc.
  • Linux – For working with a remote Linux server for processing data.
  • Git – For version control & code sharing.

Preliminary Analysis

1. Checking the demographics for MBIDs

We analyzed 100 random files from the MLHD with 3.6M rows and found the following results. In the 381k unique recording MBIDs, ~22.7% were not readily usable, i.e. they had to be redirected, or had to be made canonical. However, of these ~22.7% MBIDs, ~98.66% were correctly redirected to a valid recording MBID using the MusicBrainz database’s “recording” table, implying that only ~0.301% of all UNIQUE recording MBIDs from MLHD were completely unknown (i.e. Didn’t belong to the “recording” table OR have a valid redirect). Similarly, about ~5.508% of all UNIQUE artist MBIDs were completely unknown. (Didn’t belong to the “artist” table OR “artist_gid_redirect” table)

2. Checking for the artist conflation Issue:

There are many artists with exactly the same name. But we were unsure if for such cases last.fm’s algorithms matched the correct artist MBID for a recording MBID every time. To verify this, we fetched artist MBIDs for each recording MBID in a test set and compared it to the actual artist MBIDs present in the dataset. Lo and behold, we discovered that ~9.13% of the cases faced this issue in our test set with 3,76,037 unique cases.

SOLUTION 1

This is how we first tried dealing with the artist conflation issue:

  1. Take a random MLHD file
  2. “Clean up” the existing artist MBIDs and recording MBIDs, and find their canonical MBIDs. (Discussed in detail in the section “Checking for non-canonical & redirectable MBIDs”)
  3. Fetch the respective artist name and recording name for every artist MBID and recording MBID from the MusicBrainz database.
  4. For each row, pass <artist name, recording name> to either of the following MusicBrainz APIs:
    1. https://datasets.listenbrainz.org/mbc-lookup 
    2. https://labs.api.listenbrainz.org/mbid-mapping
  5. Compare artist MBIDs returned by the above API to the existing artist MBIDs in MLHD.
  6. If the existing MBID is different from the one returned by the API, replace it.

However, this method meant making API calls for EACH of the 27bn rows of the dataset. This would mean 27 billion API calls, where each call would’ve at least taken 0.5s. I.e. 156250 days just to solve the artist conflation issue. This was in no way feasible, and would’ve taken ages to complete even if we parallelized the complete process with Apache Spark. Even after all this, the output generated by this API would’ve barely been a fuzzy solution prone to errors.

SOLUTION 2

Finally, we tackled the artist conflation issue by using the MusicBrainz database to fetch artist credit lists for each valid recording MBID using the MusicBrainz database. This enabled us to perform in-memory computations, and completely eliminated the need to make API calls, saving us a lot of processing time. This did not only make sure that every artist MBID corresponded only to its correct recording MBID accurately 100% of the time but also:

  • Improved the quality of the provided artist MBIDs by providing a list of artist MBIDs in case of records with multiple recording MBIDs.
  • Increased the count of release MBIDs in the dataset by 10.19%!
    (Test performed on the first 15 files from the MLHD, summing up to 952229 rows of data)

3. A new contender appears! (Fixing the MBID mapper)

While working out “SOLUTION 1” as discussed in the previous section, we processed thousands of rows of data, and compared the outputs by the mbc-lookup API and mbid-mapping API, and discovered that these APIs sometimes returned different outputs when they should have returned the same outputs. This uncovered a fundamental issue in the mbid-mapping API that was actively being used by listenbrainz to link music logs streamed by users to their respective entities in the MusicBrainz database. We spent a while trying to analyze the depth of this issue by generating test logs and reports for both the mapping endpoints and discovered patterns that helped point to some bugs in the matching algorithms written for the API. This new discovery helped lucifer debug the mapper, resulting in the following pull request: Fix invalid entries in canonical_recording_redirect table by amCap1712 · Pull Request #2133 · metabrainz/listenbrainz-server (github.com)

4. Checking for non-canonical & redirectable MBIDs

To use the MusicBrainz database to fetch artist names and recording names w.r.t. their MBIDs, we first had to make sure MBIDs we used to lookup the names were valid, consistent, and “clean”. This was done by:

  1. Checking if an MBID was redirected to some other MBID, and replacing the existing MBID with the MBID it redirected to.
  2. Finding a Canonical MBID for all the recording MBIDs.

We used the MusicBrainz database’s mapping.canonical_recording_redirect” table to fetch canonical recording MBIDs, and recording_gid_redirect” table to check and fetch redirects for all the recording MBIDs. We first tried mapping SQL query on every row to fetch results, but soon realized it would’ve slowed the complete process down to unbearable levels. Since we were running the processes on “Wolf” (A server at MetaBrainz Inc.) we had access to 128GB of RAM, enabling us to load all the required SQL tables in memory using Pandas, eliminating the need to query SQL tables stored on disk.

5. Checking for track MBIDs being mistaken for recording MBIDs

We suspected that some of the unknown recording MBIDs in the dataset could actually be track MBIDs disguised as recording MBIDs due to some errors in mapping. While exploring the demographics on a test sample set of 381k unique recording MBIDs, we discovered that none of the unknown recording MBIDs confirmed this case. To further verify this problem, we ran the tests on ALL recording MBIDs in the MLHD. To hit 2 birds in one iteration, we also re-compressed every file in the MLHD from GZIP compression to a more modern, ZStandard compression, since GZIP read/write times were a huge bottleneck while costing us 671GB in storage space. This process resulted in:

  • The conversion of all 594410 MLHD files from GZIP compression to ZSTD compression in 83.1 hours.
  • The dataset being reduced from 571 GB -> 268 GB in size. (53.75% Improvement!)
  • File Write Speed: 17.46% improvement.
  • File Read Speed: 39.25% deterioration.
  • Confirmed the fact that no track MBID existed in the recording MBID column of the MLHD.

Optimizations

1. Dumping essential lookup tables from the MusicBrainz database to parquet.

We used the following tables from the MusicBrainz database in the main cleanup script to query MBIDs:

  1. recording: Lookup recording names using recording MBIDs, Get a list of canonical recording MBIDs for lookups.
  2. recording_gid_redirect: Lookup valid redirects for recording MBIDs using redirectable recording MBIDs as index.
  3. mapping.canonical_recording_redirect: Lookup canonical recording MBIDs using non-canonical recording MBIDs as index.
  4. mapping.canonical_musicbrainz_data:Lookup artist MBIDs, and release MBIDs using recording MBIDs as index.

In our earlier test scripts, we mapped SQL queries over the recording MBID column to fetch outputs. This resulted in ridiculously slow lookups where a lot of time was being wasted in I/O. We decided to pre-load the tables into memory using pandas.read_sql(), which added some constant time delay at the beginning of the script, but reduced the lookup timings from dozens of seconds to milliseconds. Pandas documentation recommends using SQLAlchemy connectable to fetch SQL tables into pandas. However, we noticed that pandas.read_sql() with a psycopg2 Connector was 80% faster than pandas.read_sql() with a SQLAlchemy Connector. Even though the pandas officially doesn’t recommend using psycopg2 at all. Fetching the same tables from the database again and again was still slow, so we decided to dump all the required SQL tables to parquet, causing a further 33% improvement in loading time.

2. Migrating the CSV reading function from pd.read_csv() to pyarrow._csv.write_csv():

We started off by using custom functions based on pandas.read_csv() to read CSV files and preprocess them (rename columns, drop rows as required, concatenate multiple files if specified, etc.). Similarly, we used pandas.to_csv() to write the files. However, we soon discovered that these functions were unnecessarily slow, and a HUGE bottleneck for processing the dataset. We were able to optimize the custom functions by leveraging pandas’ built-in vectorized functions instead of relying on for loops to pre-process dataframes once loaded. This brought down the time required to load test dataframes significantly.

pandas.read_csv() and pandas.to_csv() on their own are super convenient, but aren’t super performant. Especially when you need them to compress/decompress files before reading/writing. Pandas’ reading/writing functions come with a ton of extra bells and whistles. Intuitively, we started writing our own barebones CSV reader/writer with NumPy. Turns out this method was far slower than the built-in pandas methods! We tried vectorizing our custom barebones CSV reader using Numba, an open-source JIT compiler that translates a subset of Python and NumPy code into fast machine code. However, this method too failed due to various reasons. (Mostly by my own inexperience with Numba). Finally, we tried pyarrow, a library that provides Python APIs for the functionality provided by the Arrow C++ libraries, including but not limited to reading, writing, and compressing CSV files. This was a MASSIVE success, causing +86.11% in writing speeds and 30.61% improvements in reading speeds even while writing back DataFrames as CSV with ZSTD level 10 compression!

3. Pandas optimizations

In pandas, there are often multiple ways to do the same thing, and some of them are much faster than others due to their implementation. We realized a bit too late, that pandas isn’t that good for processing big data in the first place! But I think we were pretty successful with our optimizations and made the best out of pandas too. Here are some neat optimizations that we did along the way in Pandas.

pd.DataFrame.loc() returns the whole row (a vector of values), but pd.DataFrame.at() only returns single value (a scalar). 

Intuitively, pd.DataFrame.loc() should be faster to search and return a tuple of values as compared to pd.DataFrame.at() since the latter requires multiple nested loops per iteration to fetch multiple values for a single query, whereas the prior doesn’t. However, for our use case, running pd.DataFrame.at() 2x per iteration for fetching multiple values was still ~55x faster than running pd.DataFrame.loc() once for fetching the complete row at once!

Some of the most crucial features that pandas offers are vectorized functions.
Vectorization in our context refers to the ability to work on a set of values in parallel. Vectorized functions just do a LOT more work in a single loop, enabling them to produce results wayy faster than typical for-loops, that operate on a single value per iteration. In pandas, these vectorized functions can mean speeding up operations by as much as 1000x! For MLHD, we fetched artist MBIDs and release MBIDs (based on a recording MBID) as a tuple representing a pair of MBIDs. This meant a tuple for each recording MBID, leaving us with a series of tuples, that we needed to split into two different series. The most simple solution to this issue would be to just use tuple unpacking using python’s built-in zip function as follows:

artist_MBIDs, recording_MBIDs = zip(* series_of_tuples)

For our particular case, we also had to add in an extra step of mapping a “clean up” function to the whole series before unzipping it. The mapping process in the above case was a serious bottleneck, so we had to find something better. However, in we were able to significantly speed up the above process, by avoiding apply/map functions completely, and cleverly utilizing existing vectorized functions instead. The details for the solution can be found at: quickHow: how to split a column of tuples into a pandas dataframe (github.com)

In our first few iterations, we used pandas.Series.isin() to check if a series of recording MBIDs existed in the “recording” table of the MusicBrainz database or not. Pandas functions in general are very optimized and occasionally written in C/C++/Fortran instead of Python, making them super fast. I assumed the same would be the case with pandas.Series.isin(). My mentor suggested that we use built-in Python sets and plain for-loops for this particular case. My first reaction was “there’s no way our 20 lines of Python code are gonna hold up against pandas. Everyone knows running plain for-loops against vectorized functions is a huge no-no”. But, as we kept on trying, the results completely blew me away! Here’s what we tried:

  1. Convert the index (a list) of the “recording” table from the MusicBrainz database to a Python Set.
  2. Iterate over all the recording MBIDs in the dataset using a for-loop, and check if MBIDs exist in the set using Python’s “in” keyword.

For 4,738,139 rows of data, pandas.Series.isin() took 13.1s to process. The sets + for-loop method took 1.03s! (with an additional, one-time cost of 6s to convert the index list into a Set). The magic here was done by converting the index of the “recording” table into a Python Set, which essentially puts all the values in a hashmap (which only took a constant 6 seconds at the start of the script).

A hashmap meant reducing the time complexity for search values to O(1). On the other hand, pandas.Series.isin() was struggling with at least O(n) time complexity, given that it’s essentially a list search algorithm working on unordered items. This arrangement meant only a one-time cost of converting the index to a Python Set at the start of the script, and a constant O(1) time complexity to loop through and search for items.

Final Run

As of October 20, 2022 – We’ve finally started testing for all 594410 MLHD files to process and re-write ~27 billion rows of data. The output for a test performed on the first 15 files from the MLHD, summing up to 952229 rows of data is as follows:

Here, the cleanup process involves: Fetching redirects for recording MBIDs; Fetching canonical MBIDs for recording MBIDs; Fetching artist credit lists and release MBIDs based on recording MBIDs; and Mapping the newly fetched artist credit lists and release MBIDs to their respective recording MBIDs.

The above process is completely recording MBID oriented in order to maintain quality and consistency. This means completely wiping off the data in the artist_MBID and release_MBID columns in order to replace them with reliable data fetched from the MusicBrainz database. This also means that the above process will bring a significant change in the demographics of various entities (artist MBIDs, release MBIDs, and recording MBIDs) in the final version of the dataset.

Even though the impact of changing demographics varies from file to file (depending on the user’s tendency to listen to certain recordings repeatedly), here are some statistics based on the first 15 files in the MLHD, before and after processing:

For a complete test set with 952,229 input rows, the shrinkage is as follows:
Given an input of 952,229 rows, the row count of the original MLHD shrinks to 789,788 rows after dropping rows with empty recording MBID values. (17.06% Shrinkage). After processing, given the same input, the row count of the processed MLHD shrinks to 787,690 rows after dropping rows with empty recording MBID values. (17.28% Shrinkage). Now for a fair comparison, let’s first drop all rows with empty recording MBID values from the original, as well as the processed dataset. This gives us 787,690 in the processed dataset and 789,788 in the original dataset. The absolute shrinkage between the original and processed dataset is as follows:

Abs_shrinkage = ((789788 - 787690) / 789788) * 100 = 0.27%

Therefore, the cleaning process only resulted in a shrinkage of 0.27% of the existing recording MBIDs in the MLHD! Note that this stat is also in line with our previous estimate about how ~0.301% of all recording MBIDs were completely unknown. As per the original MLHD research paper, about ~65% of the MLHD contains at least the recording MBID. We might have the option to drop the rest of the 35% of the dataset or keep the data without recording-MBIDs as it is. Out of the 65% of the MLHD with recording MBIDs, ~0.301% of the recording MBIDs would’ve to be dropped (since they’re unknown). This leaves us with: 27bn – (35% of 27bn) – (0.3% of 65% of 27bn) = 12.285bn rows of super high quality data!

Now similarly, let’s compare the row count shrinkages for different columns.

  1. Number of counts of not-empty recording MBIDs SHRINKED by 0.27%.
  2. Number of counts of not-empty release MBIDs EXPANDED by 14.08%.
  3. Number of counts of not-empty artist MBIDs SHRINKED by 13.36%

Given an average processing time per 10,000 rows of 0.2168s, we estimate the time taken to process the entire dataset will be 27,00,00,00,000 / 10,000 * 0.2168 / 3600 / 24 = 6.775 days or 162 hours

Primary Outcomes

  1. The MLHD is currently set to be processed with an ETA of ~7 days of processing time.
  2. I was able to generate various reports to explore the impact of the “artist conflation issue” in the MLHD. These extra insights and reports uncovered a few issues within the MusicBrainz ID Mapping lookup algorithm, which resulted in lucifer fixing Fix invalid entries in canonical_recording_redirect table by amCap1712 · Pull Request #2133 · metabrainz/listenbrainz-server (github.com)

Miscellaneous Outcome

How I got picked as a GSoC candidate without a single OSS PR to my name beforehand is beyond me, but with the help of alastairp and lucifer, I was able to solve and merge PRs for 2 issues in the listenbrainz-server as an exercise to gain get to know the listenbrainz codebase a little better.

My Experience

This journey has been nothing but amazing! The sheer amount of things that I learned during these past 18 weeks is ridiculous. I really appreciate the fun people and work culture here, which was even more apparent during the MetaBrainz Summit 2022 where I had the pleasure to see the whole team in action on a live stream.

Coming from a Music Tech background and having extensively used MetaBrainz products in the past, it was a surreal experience being able to work with these supersmart engineers who have worked on technologies I could only dream of making. I often found myself admiring my seniors as well as peers for their ability to come up with pragmatic solutions with veritable speed and accuracy, especially lucifer, whose work ethic inspired me the most! I hope some of these qualities eventually rub off on me too 🙂

I’d really like to take time to appreciate my mentor, alastairp for always being super supportive, and precise in his advice, and helping me push the boundaries whenever possible. I’d also like to thank him for being very considerate, even through times when I’ve been super unpredictable and unreliable, and not to mention, giving me an opportunity to work with him in the first place!

Suggestions for aspiring GSoC candidates

  • Be early.
  • Ask a lot of questions, but do your due diligence by exploring as much as possible on your own as well.
  • OSS codebases might seem ginormous to most students with limited work experience. Ingest the problem statement bit by bit, and slowly work your way toward potential solutions with your mentor.
  • Believe in yourself! It’s not a mission impossible. You always miss the shots that you don’t take.

You can contact me at:
IRC: “Pratha-Fish” on #Metabrainz IRC channel
Linkedin: https://www.linkedin.com/in/prathamesh-ghatole
Twitter: https://twitter.com/PrathameshG69 
GitHub: https://github.com/Prathamesh-Ghatole

MusicBrainz Android App: Adding BrainzPlayer in Android App

Greetings, Everyone!

I am Ashutosh Aswal (IRC nick yellowhatpro), pursuing my bachelor’s from Punjab Engineering College Chandigarh, India. As a Google Summer of Code’22 contributor, I worked for MetaBrainz, on the MusicBrainz Android app and added a music playback feature to the app, which we call BrainzPlayer.

During the GSoC period, I was mentored by Akshat Tiwari (akshaaatt). Through this post, I will be summarizing my journey throughout the summer with MetaBrainz.

Let’s begin!! ( •̀ ω •́ )✧

Project Description

The project’s target was to introduce BrainzPlayer, a local music playback feature, into the MusicBrainz Android app. After this feature integration, users can play locally saved music directly from the app.

My pull requests.

My commits.

Coding Journey

We started with setting up the Music Service, Exoplayer, and the related Media APIs, which allow playback to be possible on the device, even when the app is in the background.

After this, we defined the Media Source, which accesses our local storage to search the media items and make them accessible within the app.

After accomplishing this, we worked on the notifications feature, which shows the metadata of the currently playing media item, and lets us control the playback, like seek, play, pause, etc., directly from the notification panel without opening the app.

Notification Panel

Now we worked on a service connector class that contains the functions to deal with the playback commands within the app.

After this, our app was ready to play songs. Now was the time to add some cool UI.

The UI is written in Jetpack Compose, Android’s latest toolkit for building awesome UI. Using Compose we worked on the Player Screen, which contains the playback features.

Now that we have the music playback feature, we worked on different entities: song, album, artist, and playlist.

To achieve this, we introduced a local database within the app. We introduced the various entities, including the required data and logic layer.

We wrote multiple database queries and added repositories for the entities in the data layer. Then we worked on the logic part and created functions that took in account the data layer and would show the result in the UI.

After working on the data and logic layer, we focused on creating the UI for the different entities. Each entity has its screen, from where the user can play songs. For this, we coordinated with aerozol, and I would thank him for coming up with beautiful designs and our BrainzPlayer logo. Then finally, with the designs in hand, we could execute them in compose.

By the end of the program, we were able to add some animations, and find out bugs and fix them.

Finally, the BrainzPlayer feature is merged with the master branch, so we can expect it to go into production soon. \^o^/

Preview of the upcoming feature:

Acknowledgement:

I want to thank my mentor, akshaaatt, for his immense support and guidance. Under his mentorship, I could learn, experiment, and improve my code quality over the time.

I am also indebted to the MetaBrainz team for their kind and supportive behavior, which made the journey incredible and unforgettable, and makes me motivated to work with them even beyond.

That’s it from my side.
Thank you for having me !! ヾ(≧▽≦*)o

GSoC’22: Personal Recommendation of a track

Hi Everyone!

I am Shatabarto Bhattacharya (also known as riksucks on IRC and hrik2001 on Github). I am an undergraduate student pursuing Information Technology from UIET, Panjab University, Chandigarh, India. This year I participated in Google Summer of Code under MetaBrainz and worked on implementing a feature to send personal recommendation of a track to multiple followers, along with a write up. My mentors for this project were Kartik Ohri (lucifer on IRC) and Nicolas Pelletier (monkey on IRC)

Proposal

For readers, who are not acquainted with ListenBrainz, it is a website where you can integrate multiple listening services and view and manage all your listens at a unified platform. One can not only peruse their own listening habits, but also find other interesting listeners and interact with them. Moreover, one can even interact with their followers (by sending recommendations or pinning recordings). My GSoC proposal pertained to creating one more such interaction with your followers, sending personalized recommendations to your followers.

Community Bonding Period

During the community bonding period, I was finishing up working on implementing feature to hide events in the feed of the user and correcting the response of missing MusicBrainz data API. I had also worked on ListenBrainz earlier (before GSoC), and had worked on small tickets and also had implemented deletion of events from feed and displaying missing MusicBrainz data in ListenBrainz.

Coding Period (Before midterm)

While coding, I realized that the schema and the paradigm of storing data in the user_timeline_event that I had suggested in the proposal won’t be feasible. Hence after discussion with lucifer in the IRC, we decided to store recommendations as JSONB metadata with an array of user IDs representing the recommendees. I had to scratch my brain a bit to polish my SQL skills to craft queries, with help and supervision from lucifer. There was also a time where the codebase for the backend that accepted requests from client had a very poorly written validation system, and pydantic wouldn’t work properly when it came to assignment after a pydantic data-structure had been declared. But after planning the whole thing out, the backend and the SQL code came out eloquent and efficient. The PR for that is here.

{
     "metadata": {
        "track_name": "Natkhat",
        "artist_name": "Seedhe Maut",
        "release_name": "न",
        "recording_mbid": "681a8006-d912-4867-9077-ca29921f5f7f",
        "recording_msid": "124a8006-d904-4823-9355-ca235235537e",
        "users": ["lilriksucks", "lilhrik2001", "hrik2001"],
        "blurb_content": "Try out these new people in Indian Hip-Hop! Yay!"
    }
 }

Example POST request to the server, for personal recommendation.

Coding Period (After midterm)

After the midterm, I started working on creating the modal. At first my aim was to create a dropdown menu for search bar using Bootstrap (as most of the code had bootstrap rather than being coded from scratch). But after a while I consulted with lucifer and monkey and went for coding it from scratch. I had also planned to implement traversing search results using arrow keys, but the feature couldn’t be implemented in due time. Here are screenshots of what was created in this PR.

Accessing menu button for personal recommendation

A modal will appear with a dropdown of usernames to choose for personal recommendation

Modal with usernames chosen and a small note written for recommendation

If one has grokked my proposal, they might already notice that the UI/UX of the coded modal is different from the one proposed. This is because while coding it, I realized that the modal needs to not only look pretty but also go well with the design system. Hence the pills were made blue in color (proposed color was green). While I was finishing up coding the view for seeing recommendations in the feed, I realized that the recommender might want to see the people they had recommended. So, I asked lucifer and monkey, if they would like such feature, and both agreed, hence this UI/UX was born:

Peek into the feed page of recommender

What a recommendee sees

Special thanks to CatQuest and aerozol for their feedbacks over the IRC regarding the UI/UX!

Experience

I am really glad that I got mentorship from lucifer and monkey. Both are people whom I look up to, as they both are people who are not only good at their crafts but are also very good mentors. I really enjoy contributing to ListenBrainz because almost every discussion is a round table discussion. Anyone can chime in and suggest and have very interesting discussions on the IRC. I am very glad that my open source journey started with MetaBrainz and its wholesome community. It is almost every programmer’s dream to work on projects that actually matter. I am so glad that I had the good luck to work on a project that is actually being used by a lot of people and also had the opportunity to work on a large codebase where a lot of people have already contributed. It really made me push my boundaries and made me more confident about being good at open source!

My Google Summer of Code 2022 summary

What and for whom

Organization: MetaBrainz Foundation
Project: MusicBrainz Picard
Mentors: Laurent Monin (zas) & Philipp Wolfer (phw)
Main focus: Introducing single-instance mode in Picard 3.0
GSoC website: Link

What has been done: TL;DR edition

  • Picard works in single-instance mode by default, allowing to force-spawn a new instance
  • Picard accepts not just file paths but also URLs, MBIDs and commands as command-line arguments
  • The command-line arguments are sent to the existing instance (and processed by it) if possible
  • Picard can execute commands passed by the command-line interface; e.g. save all files, show the Picard window or close the app
  • Picard can also load the commands from a text file

List of pull requests

Single-instance mode

  • Picard#2116: A big commit where the whole single-instance mode for Picard was designed and introduced (only for file paths though)
  • Picard#2135: Fixed problems with exiting the app, caused by Picard#2116
  • Picard#2130: Supported URLs (with MBIDs) and mbid:// links (documented there) can be passed with file paths via CLI to an existing (or to a new one) instance
  • Picard#2137: Supported commands (like QUIT or SHOW) can be passed via CLI to an existing instance

Picard remote commands enhancements

  • Picard#2141: REMOVE_EMPTY & REMOVE_UNCLUSTERED commands added
  • Picard#2142: LOAD command, extending the positional arguments’ functionality, added
  • Picard#2143: FROM_FILE command, executing a command pipeline from a given file, added
  • Picard#2144: CLEAR_LOGS command added
  • Picard#2145: Fixed errors with the FROM_FILE command
  • Picard#2146: WRITE_LOGS command, allowing to save Picard logs into a file, added

Code refactoring

  • Picard#2080: Code explicitly marked as deprecated got removed, my initial commit to get to know the Picard’s codebase and workflow
  • Picard#2127: Minor patch, unparsed args are now ignored as they were not used anywhere
  • Picard#2139: Refactored the whole process of passing arguments to Picard, replaced ‘%’-formatted strings with f-strings, more than one arguments can be passed correctly to a command

Other

What have I learnt during GSoC 2022

  • How to work with other people on GitHub
  • How to improve my git experience (e.g. hooks)
  • How one can handle inter-process communication, basically I have researched:
    • pipes
    • named pipes
    • sockets
    • dbus
  • How to use Windows API with Python
  • Differences between Windows and Unix pipes
  • \0 is the only character that is prohibited on both Windows & Unix in path names
  • /tmp is not the recommended way to store non-persistent app data on *nix
  • os._exit might be useful when pythonic threads get broken
  • Importing a tuple in Python is underrated. git diff gets cleaner, as one sees only the additions

Some personal thoughts

  • Python is a really decent language that helps with starting one’s programming journey but the deeper I went, the more annoyances I have encountered (that is why I ended up starting to work as a C++ dev)
  • Ultra-safety is a double-edged sword: good luck terminating Pythonic futures/threads with file operations
  • CI/CD and testing in general is as important as decent codebase
  • If one can plan their time well, flexible work hours make their work both more effective and more enjoyable
  • Python sometimes change for worse or breaks the code without any reason (e.g. they have switched from using a mode into w on pipes, ref: LINK)
  • I will not start any new personal project in Python (especially one using multi-threading, multiple processes etc.), unless forced to do so. Nu for scripting, filling the niche & exploring the functional programming, some statically-typed languages for bigger projects, games, research, etc.
  • Impostor syndrome is just an another excuse to procrastinate. Do not be scared to learn & do new things but also ask smart questions. Everyone makes mistakes but if you made it to this org, you are a good fit and have enough qualifications

Special thanks

The whole MetaBrainz community is awesome and I am glad I have become a part of it, but I would like to express my special gratitude to the people I have directly worked with in any way 🙂 (alphabetical order by github username)

GSoC’22: CritiqueBrainz reviews for BookBrainz entities

Greetings, Everyone!

I am Ansh Goyal (ansh on IRC), an undergraduate student from the Birla Institute of Technology and Science (BITS), Pilani, India. This summer, I participated in Google Summer of Code and introduced a new feature, CritiqueBrainz reviews for BookBrainz entities.

I was mentored by Alastair Porter (alastairp on IRC) and Nicolas Pelletier (monkey on IRC) during this period. This post summarizes my contributions made for this project and my experiences throughout the journey.

Proposal

Book reviews are a glimpse into a world you may or may not choose to enter. Reviews give books greater visibility and a greater chance of getting found by more readers. BookBrainz and CritiqueBrainz should enable users to rate and write reviews directly through the website using the CritiqueBrainz API.

For GSoC ’22, I made a proposal that aimed at adding CritiqueBrainz reviews for BookBrainz entities.

Community Bonding

During the community bonding period, my mentors, Alastair Porter and Nicolas Pelletier, helped me create a streamlined pathway to move along with the project. We also worked on various tickets like CB-433, CB-434, and CB-410 and added multiple features.

So we decided first to complete reviewing BookBrainz edition groups thoroughly from CritiqueBrainz as well as BookBrainz, and then extend the project for all other entities like literary work, author, and Series.

We discussed the various database and structural changes involved in the project, like adding BookBrainz Database in CritiqueBrainz, adding tables in BookBrainz DB, etc., the page designs and overall improvements.

Coding Period

The coding period starts with importing the BB database in CB to fetch the required information and perform tests. 

Now that the database was set up and ready for us to work on, it was time to write SQL queries for fetching the edition groups and all the other associated information, like identifiers and relationships. I made the code reusable to prevent duplication while fetching data for different entity types. So I opened a PR for it.

So now it was the time to allow users to review an edition group in CritiqueBrainz. For this, I made a few changes in the database, allowed BB entity types, and then added pages for their reviews in this PR. Then I worked on showing the information fetched from the BookBrainz database to the users on their entity pages.

Then to allow users to search edition groups, I worked on adding an option to search BB entities with the help of the search API already present in BB. This feature was implemented in this PR.

After adding support for Edition Groups, it was time to add support for other entity types. This expansion was very smooth because of the reusable components created by then. So I added support for Literary Works, Authors and Series. Later we discovered that the series items were not being ordered correctly, so this was fixed in #460.

During this process, my mentors and I discovered some improvements and refactoring, which were done in #445#451 and #456.

After adding support for all the entities, I added support for showing the relationships between the entities on the respective entity pages. These included showing Author-Work, Work-Edition Group, and Work-Work relationships.

CritiqueBrainz Author’s Page

While enabling CritiqueBrainz to review entities, I was also working on BookBrainz to allow users to view reviews and ratings and post them from  the entity’s page. 

So I started adding support to fetch and display reviews and ratings for edition groups which involved creating a route which would handle getting and pushing reviews to CritiqueBrainz. 

After this step, it was time to connect BookBrainz with CritiqueBrainz. This involved authentication using OAuth login. To add this feature, I first added a table ‘external_service_oauth’ in the database and then in the ORM.

Then I added routes to allow user login to CritiqueBrainz, handled the callback, and saved the tokens in the database. After that, the next thing was to allow users to post reviews from the entity page. For that, I create a modal similar to the one in ListenBrainz (to maintain consistency).

BookBrainz

After completing my project, I began working on my stretch goals and starting with unifying reviews for entities common in both BookBrainz and MusicBrainz databases. We decided that if an entity exists in both databases, we show the reviews for all the entities on the entity page (PR).

Overall Experience

I am incredibly grateful to my mentors for their constant guidance and support throughout my project. I learned a lot of technical concepts and improved the quality of my code during this journey. I had a wonderful time interacting with the amazing folks at MetaBrainz and exchanging valuable thoughts during our weekly meetings.

I would love to thank Google and the MetaBrainz Foundation for providing me with this great opportunity!

GSoC 2022: Unified Form Editor for BookBrainz

Hi, I am Shubham gupta (IRC Shubh) pursuing my bachelor’s from the National Institute of Technology, Kurushetra. This year I participated in Google Summer of Code and implemented a new editor in Bookbrainz.

In this project, I was mentored by Nicolas Pelletier (IRC monkey). The purpose of this blog is to summarize my contribution made for this project and share my experiences along the way.

Before GSoC

I joined metabrainz at the end of November’21, due to my affection for novels I instantly fell in love with BookBrainz project. I initially started working on small bug fixes and typo corrections but later shifted to work on more challenging features.

My first challenging work was to pre-fill the current entity editor with POST requests which was required for user scripts to work and also created some user scripts to help simplify the creation process.

Later I worked on to upgrade react-select and completing notification feature on BB.

Proposal

For GSoC’22, I made a proposal to work on implementing a unified form on BookBrainz.

My main motivation behind this project was to make the entity creation process more intuitive and simpler for new users. The purpose of this project is to unify all the workflows of entity creation into a simpler book interface, this abstracts away the BookBrainz specifics for users and provides them an easy interface to work with.

Though a lot has changed since the proposal, from design to implementation details, the main idea behind the project remained unchanged.

Community Bonding

During this period, I worked with my mentor to finalize the design for the editor. This included a lot of back & forth discussion but finally, we ended up picking a base design that was similar to how we choose a book: we first go through the book’s cover and back cover, then its details and inner contents.

I also discussed a new timeline with my mentor which incorporated my university classes and exams. Following this I started implementing the editor from this period itself.

Coding Period

First Phase

By this phase, I completed all the mockups and made relevant changes in design as suggested by community members.

We ended up with the following design, also later we added a summary section as a new tab to make reviewing new entities easier.

unified-form mock

I started working on the new editor routes which can support multiple entity submissions for the creation and later added support for editing as well.

Pull requests: #847#858

Since a lot of implementation was similar to existing entity routes, the main thing that was missing was to unify them into one api and make it support temporary BBIDs for new entities.

The main idea behind keeping temporary BBIDs was to allow late submission of entities, meaning new entities would only be created when the user submits the whole form. This allowed a user to undo their actions and gave them more granular control over entity creation/modification. But following this approach resulted in a lot of duplicate code which was hard to generalize due to temporary ids; this was later fixed in the second phase.

I completed working on the routes part with suitable tests and started working on the React front-end. I started by setting up a Redux store to handle multiple entity states, after some discussion with my mentor we ended up going with the state design that segregates each entity into their own states.

At the end of this phase, the editor looked something like this

First Phase Screenshot

Second Phase

During this period I continued working on the editor interface since that is the meat of the project it took most of the time of this project.

Frontend PR:#850

The challenging part of managing a large store like we had was to minimize the state updates as much as possible, since this was so crucial for performance I spent about a week reading redux articles and profiling editor state. All this paid off and resulted in blazingly fast editors (entity/unified-form) with minimal state update calls, which also benefited the existing editing pages.

The solution was to reduce the scope of a redux state by memoizing the components as much as possible and caching the results of expensive calculations which reduced component load time drastically.

After implementing all entity creation workflows, I moved to linking them, either through relationship or by some other attributes.

This linking process needs to be automatic and users don’t have to know the relationships, they should also be able to opt-out of linking specific entities with respective checkboxes.

An example of linking entities is Series-Work, where selecting a Work already adds it to a Series item.

Unified Form Series Editor

We also introduce a major change in the way we submit the entities: we now submit new entities directly to the server. This reduced the duplicate code by half as compared to before since now we don’t have to manage those temporary ids anymore. This also resulted in reducing the amount of work potentially lost when an error occurs during filling the form.

I also wrote mocha/enzyme tests for required React components. This is all for the frontend PR.

I made the follow-up PRs to improve UI and introduced bug fixes: #872#871#874

Overall Experience

I enjoyed working together with my mentor on such a large project. I learned a lot during my journey and understood the importance of different phases of software development. I realized the importance of carefully designing the application and discussing the ideas with other team members. I also got to know a lot about testing and why it is so important for large projects like this, overall the best learning experience I could ask for.

Also, the members of the MetaBrainz foundation are very supportive and help each other to resolve issues. Lastly special thanks to my mentor Nicolas Pelletier who helped me a lot during my GSoC journey. He always supported me and encouraged me even when things weren’t looking good. He is truly one of the most amazing people I’ve ever met!

GSoC 2021: Pin Recordings and CritiqueBrainz Integration in ListenBrainz

Hi! I am Jason Dao, aka jasondk on IRC. I’m a third year undergrad at University of California, Davis. This past summer, I’ve been working with the MetaBrainz team to add some neat features to the project ListenBrainz.

Continue reading “GSoC 2021: Pin Recordings and CritiqueBrainz Integration in ListenBrainz”

GSoC’21: MusicBrainz Android App: Dawn of Showdown

Greetings, Everyone!

I am Akshat Tiwari (akshaaatt on IRC), an undergraduate student from Delhi Technological University, India.

It has been an exhilarating experience for me, right from submitting a proposal for GSoC to becoming a part of a fantastic community.

The Google Summer of Code 2021 Edition finally comes to an end after the 3-Month long journey. I will be detailing the journey of working towards my summer of code project today. This blog is a summary of all the work done.

Continue reading “GSoC’21: MusicBrainz Android App: Dawn of Showdown”