GSoC’23: Dataset Hoster Improvements

Hi Everyone!

I am Vishal Singh (also known as Pixelpenguin on IRC). This year I participated in Google Summer of Code under MetaBrainz and worked on improving MetaBrainz Dataset Hoster repository. My mentor for this project was Kartik Ohri (lucifer on IRC). This post summarizes my contributions made for this project.

MetaBrainz Dataset Hoster

The MetaBrainz Dataset Hoster project involves processing data to populate a Python object and subsequently hosting the results as an API. This initiative stems from MetaBrainz’s music recommendation work, aiming to efficiently transform and evaluate datasets for integration into their recommendation tools.

API’s can be created using a simple template class such as

Coding Tasks

Allow users to invoke new dataset hoster queries from results

The idea for a dataset hostler is to walk through the datasets that are in the data set hoster. This feature allows users to repeat the similar artist query from one of the results of that page.

Dropdown is shown next to a row of results in the dataset hoster, such as a recording MBID. This dropdown allows users to utilize the displayed MBID as input for a new query within the dataset hoster. Which then is opened in a new tab, allowing recursively to go through the dataset.

Types support in Dataset Hoster

Feature allows queries to define the allowed types for each input so that the dataset holster repository can perform validation on the inputs. For instance, the Similar Recordings query has defined an input model as:

If an invalid mbid is input by the user, Dataset Hoster throws an error.

Added select input to dataset hoster

Currently, datasethoster only handles text inputs. However, sometimes a query wants to specify a list of acceptable values. To do so, the query can declare an enum field and its possible values will be rendered as a HTML select field to the user with all the possible values of the Enum as the option.

The implementation also allows for select fields with dynamic options. For example, the ‘algorithm’ field in a similar-recordings query needs to be populated dynamically depending on the datasets currently present in the database tables. Further, these options should populate dynamically, allowing queries to fetch values from databases or Redis. Options could be cached on app startup or fetched per query execution for efficiency.

Result:

Datetime field support in dataset hoster

For queries that accept a datetime field in their input, we display a HTML datetime input to let the user choose the desired time easily. When using the JSON API version, the user can submit datetime as a UTC timestamp or ISO format date string.

Result:

List of issues

LB-1246: Datetime field support in dataset hoster
LB-1244: Add select input to datasethoster
LB-1249: Allow users to invoke new dataset hoster queries from results
LB-1243: Types support in Dataset Hoster
LB-1247: Improving response format for dataset hoster

List of Pull Requests

#15: Array type support in Web query handler
#14: Queries to be used in dropdown for similar result
#12: Pydantic support in query inputs outputs
#9: Strip space from args
#2533: LB Labs recording similarity: Dynamic selection of options
#2522: Migrate all labs api to have Pydantic support

What’s Left?

Pull requests in Dataset Hostler are merged, ensuring the implementation of the features discussed above. However, the migration of queries related to these changes are not yet deployed to production. because updates to these queries also require changes in their respective clients as well, only after simultaneous improvements can the changes be deployed.

More details can be found here – Migrate all labs api to have Pydantic support

Experience and Learning

I feel this was a once in a lifetime experience for me. Metabrainz gave me this opportunity and opened me to the world of open source. I am really glad that I got mentorship from lucifer (Kartik Ohri), who helped me through the entire period. There is still a lot I can learn from him, he is really an expert when it comes to the tech stack of Metabrainz.

Project helped me learn about python, API’s, docker, music (and its presence in open source) and multiple details about the production environment. It was really interesting to learn about scalability involved in api queries and how the servers of Metabrainz handle them.

Lastly, I want to give a special thanks to Google for organizing such amazing..

	Jaded Lover on SSL.com is evil and deceptive:…
	aerozol on SSL.com is evil and deceptive:…
	reosarevok on GSOC ‘23: Automating Area Mana…
	aerozol on Welcoming Hazel Savage to our…
	Suryansh Shakya on The Top 8 Mistakes of GSoC…