Give me music that I like.
When you start discovering yourself, just know that you are at the right place and with the right people.
MetaBrainz is the one for me!
I am Vansika Pareek (pristine__ on IRC), an undergraduate student at National Institute of Technology, Hamirpur, India. I have been working on the ListenBrainz-Labs project for MetaBrainz as a participant in Google Summer of Code ’19. The end of GSOC’19 is a beginning for me. Cheers!
How it all started?
I learned about MetaBrainz from a close friend of mine, and the idea behind the organisation caught me. I started familiarizing myself with the codebase of ListenBrainz towards the end of December ’18. ListenBrainz-server is a project of MetaBrainz which keeps track of songs listened to by the users and makes this incredibly useful music usage data available to the world. Presently, we have around 270M listens in our database with 7k users. Whoa!
My first PR was against LB-398 which integrated “remember-me” functionality in flask sessions so that the user is not logged out even after the session expires or the browser window is closed. The PR also introduced a schema change to create alternate tokens for every user which can be used to force log out some/all users (modify a user’s alternate token and the remember-me cookie will expire).
I had around 21 commits in ListenBrainz-Server by February ‘19. They all addressed open issues, bug fixes and a little cleanup. I started working on ListenBrainz-Labs in parallel.
ListenBrainz had recently shifted from Google BigQuery to Apache Spark for fetching user statistics. My first significant PR in ListenBrainz-Labs was to write queries to get user statistics. The stats include top artists listened to by the user, top tracks listened to by the user, etc. These user statistics would reside on a server that uses pyspark to handle big data, to make them available for users they were supposed to be transported to a different server. The next PR built a producer using RabbitMQ to push data in the queue and take it to the destination. The destination here was ListenBrainz server for which I opened a PR to consume this data and make entries in the database. Hurrah! We just connected two servers.
By this time I was sure about my summer project. Bang!
Apache Spark is a great tool for handling and processing big data. Listening history of users can be used in remarkable ways (which is kind of obvious). So we decided on a very cool and interesting project: An open source music recommendation engine. A number of changes were introduced in the base proposal, which will be reflected later in the post.
Community bonding period
Well begun is half done! During this period, my mentor Robert Kaye made sure that we define a clear path to move along with the project. The first step was to get hands-on with Spark. Apache Spark was a new technology for me so I spent most of my time during this period in learning Spark and optimizing its performance.
Rob broke the project into simple steps to help me move with it. LB-440 and how-to-guide helped me significantly to come up with a working plan. There were three main steps around which the whole project revolves.
Step 1: Listens in ListenBrainz are stored in the following format
'artist_mbids', 'artist_msid', 'artist_name', 'listened_at', 'recording_mbid', 'recording_msid', 'release_mbid', 'release_msid', 'release_name', 'tags', 'track_name', 'user_name'
Listens should be used to know how many times a user listened to a particular track. We called this the pre-processing step.
Step 2: Train the data using Spark’s machine learning library (MLLib). To be specific, we planned to use Collaborative filtering to find tracks that a user may like. Spark uses ALS (Alternating Least Squares) for the same.
Step 3: Use the model from step 2 to generate recommendations for users.
But what the heck is ALS? Let us find out…
To understand ALS we will first have an introduction to matrix factorization. Consider a matrix P where P=U×I. U stands for user and I stands for item. Each cell in this matrix represents user preference of the item. Matrix factorization (or matrix completion) attempts to directly model this user-item matrix by representing it as a product of two smaller matrices of lower dimension. Thus, it is a dimensionality-reduction technique.
If we want to find a lower dimension (low-rank) approximation to our user-item matrix with the dimension k, we would end up with two matrices: one for users of size U×k and one for items of the size I×k. These are known as factor matrices. If we multiply these two-factor matrices, we would reconstruct an approximate version of the original P matrix which simply mean that we will now have some value in the empty cells which will tell us if the user likes the item or not.
As the name suggests, ALS alternates between U×k and I×k. In each iteration, one of the user‐ or item‐factor matrices is treated as fixed, while the other one is updated using the fixed factor and the rating data. Then, the factor matrix that was solved for is, in turn, treated as fixed, while the other one is updated. This process continues until the model has converged (or for a fixed number of iterations).
It was a really nice explanation of ALS. Let us thank Machine Learning with Spark—for a full explanation refer to Chapter 4.
I made a few PRs during this period to prepare the base for the project.
First coding period
ListenBrainz-Labs uses the Hadoop Distributed File System as storage with Spark. Listens were already stored in HDFS (a parquet file for each month stored in directories sorted according to the year) before the beginning of this project. Using these listens, I wrote three basic scripts to generate recommendations for users. The bottleneck here was time. The first draft of the scripts took 14 hours to execute for 200M listens. Phew!
Well, we now do it in minutes. Yay! Want to know how? Keep reading…
I did not understand the intricacies of parallel computing until I spent a whole day staring into the screen and see how slowly each task came to an end. For those three scripts, I used RDD API of Spark. Boom! Here is the clue. Dataframe and Dataset APIs are much faster than RDD API because they use a magic potion called the Catalyst Optimizer. Hence, my next job was to use the Dataframe API. It really helped: the time reduced to 4 hours. In order to explain to Rob and other community members about what exactly is happening under the hood, we decided to generate HTML files for each script. These files really helped in getting community feedback about the quality of recommendations. During the early phase of coding, Rob explained to me the importance of exception handling. We spent a lot of time on catching exceptions and throwing understandable error messages to the user/programmer.
The recommendations gave users music which they DID NOT LIKE. Huh! Let us find a solution in the next phase.
Second coding period
After we have trained the data, we need to give ALS tracks to choose from. But wait! What should these tracks be like? If we are feeding all the 200M tracks to the ALS, no matter how beautiful (model with least RMSE) our model is, the results would be unexpected. Also, it would take a lot (a lot!) of time. What to do? Brainstorming!
(While all this was happening, I was trying to figure out configurations for our spark cluster. How many executors, driver memory, etc. Too much math!)
Rob gave this super amazing idea of candidate sets. So what are these candidate sets? For instance, every track that user A has listened to belongs to artists in list S.
List S = [artist 1, artist 2, …, artist m]
The tracks that we were feeding to ALS belonged to all the artists that are in ListenBrainz DB, artists which user A have not even heard of. It is very likely that user A won’t prefer songs of such artists. So for every user, we should filter out the artists that the user has ever listened to, collect tracks associated with these artists and generate pretty recommendations. Sounds cool?
There is a flaw in this approach. Let us dive deep!
If we are recommending tracks of artists that a user has ever listened to, we will never be able to recommend tracks of artists new to a user which the user may like. Also, our main goal to promote new artists would be left incomplete. Similar artists came to our rescue. After much discussion, we decided to generate candidate sets through the following steps:
- Get top X artists listened to by a user in a time window.
- Get top Y artists similar to these top X artists.
- For similar artists, MusicBrainz project of MetaBrainz was used. Rob made sure that artists similar to each artist should be shipped to HDFS ❤
- Generate two candidate sets for each user. First candidate sets contain all the tracks belonging to top artists and second candidate set contain all the tracks belonging to similar artists in a time window.
Hurrah! Recommendations generated using the candidate sets were liked by the community. Here is a PR for the same. The recommendations are ready to be shipped. Hang on! Where to store them?
Hint! Remember the pipeline built using RabbitMQ to connect ListenBrainz-Server and ListenBrainz-Labs?
We decided on using the same pipeline after changing the queue and exchange so that data packets meant for two different consumers don’t collide. The road is all set. Now we just need to prepare the destination. This period I spent most of the time chatting with Rob on what changes to be made in the schema, if creating a new schema is a better option, what sort of tables should be part of the schema, etc. After much discussion, this PR created a new schema to store recommendations.
All this while many PRs were opened and merged in ListenBrainz-Labs which addressed minor bugs and mostly Spark issues. In total, I have 31 commits in ListenBrainz-Server and 102 commits in ListenBrainz-Labs.
A new beginning ❤
Google Summer of Code and MetaBrainz are among the best things that have ever happened to me. I have seen myself change every day. I started to think like a developer and not just a programmer. I have learnt how to write code which is readable and usable in the long run. I saw the hesitation in me to interact with people reduce day by day. I have learnt to always be open to change. MetaBrainz is like a second home!
I have learnt to not forget about valleys while looking at the mountains.
As I have mentioned earlier, the end of GSOC is just a beginning. There are things which are left to be completed to let this project breathe. I look forward to continuing working on this project throughout the year.