Data – Page 3 – MetaBrainz Blog

Announcing the musicbrainz-data Java library

Stefan Sperber has just announced the open source release of musicbrainz-data, a Java library that uses Hibernate to interface to the MusicBrainz PostgreSQL database:

We at Last.fm are happy to announce that we are open-sourcing musicbrainz-data, our Java data bindings for the MusicBrainz Database.

The source code for musicbrainz-data and information on how to use it in your projects can be found on our GitHub:

https://github.com/lastfm/musicbrainz-data

Please report any issues at

https://github.com/lastfm/musicbrainz-data/issues

If you have any questions, suggestions or feedback please post them in the musicbrainz-devel mailing list. I will also attempt to be available on the #musicbrainz-devel IRC channel on Freenode (nick: stefans).

Thanks Stefan!

Please Help Us Sanity Check More Removals

ocharles has just finished work on MBS-2547 which will result in empty labels and empty release groups also being deleted as part of our daily clean up jobs, just like empty artists are currently deleted. As we’ve never cleaned up empty labels or release groups in the past, there is quite a bit of data that will be deleted in the first run. Before we run this, we would like to share it with the community in case any of this data is important.

The cleanup algorithm is mostly the same as it is for artists. Labels and release groups will only be deleted if:

They have existed for more than 24 hours
They have no open edits, or open edits that show up in their edit history
They have no relationships
They have no releases

Here are the lists of labels and release groups that would be removed, if we were to run the script right now:

If you do not want a label or release group to be deleted, please add relationships, as you would with artists or works.

Upgrading our data licenses

A potential customer just nudged me about our use of now deprecated CC licenses and a deprecated Public Domain dedication. We should really address these issues and upgrade our data licenses. We are currently using the following licenses:

Creative Commons attribution, non-commercial, share-alike ver. 2
An old public domain dedication, ~~similar to this deprecated mark~~, from the CC Public Domain Dedication.

I propose that we move to:

The move to the CC license version 3 allows us to use the latest and most robust version of the Creative Commons license. If you’d like lots more detail on what this change means, please read the CC version 3 license “brief” explanation.

The CC0 grant improves a number of aspects of our old Public Domain dedication, especially in jurisdictions outside the US. Our current dedication isn’t sufficient to renounce any copyright over the data in some countries. The CC0 grant is an improved version that maximizes the global coverage of our rejection of copyright for our data.

Summary in plain english: We’d like to move to updated, more robust licenses. We are not changing what data is available under which license, nor are we taking away any rights that end users already have.

If you have questions or comments, please post them here. If there are no objections to this change, I will make it effective with our May 15th release.

UPDATE: Fixed an incorrect link that Mike from the Creative Commons pointed out. Thanks Mike!

Please tell us what you think about our proposed "no waiting" access to our Web Service

We’re working to add a paid option to our Web Service for commercial users and for end users who would like to have faster access. We’re finally getting close to being able to offer this service and we would like to get some feedback from our users about this.

We are proposing to add “no waiting” access to our web service — this proposed service would:

Allow continuous sequential access to our version 2 web service without delays between calls. You would not be required to have any delays between calls to the web service (our current service requires a 1 second delay between calls)
Still have a global rate limit that may temporarily deny callers access to our web service (with 503 responses) if our service gets overloaded. We would work hard to ensure that our service would not reach this limit, since its a paid service, but we cannot guarantee that.
Not allow concurrent (more than one call at a time) calls to our web service per user. We reserve the right to terminate your service if we find that you are making concurrent calls to our “no waiting” service.

Our existing web service will not be affected by this new service — the existing service will remain free and limited to one request per second as it is now. Initially the new service is intended for end-users who wish to have faster access to our web service. Once we’ve ironed out the kinks in this new service we will offer this service for commercial customers as well.

Finally, we’ve set up a very short survey (3 questions only!) to gather some feedback from you about this service. We’re mainly trying to establish a reasonable price for this service. Please take our survey and let us know what you think and how much you’d pay for our service.

All of your responses will be private and the survey does not ask for any information about you. Thanks, we appreciate your thoughts!

User agent based throttling is now live

Yesterday we talked about rolling out our throttling based on User-Agent strings. A few minutes ago we pushed this feature live on our servers so now the updated rules are in effect. python-musicbrainz/0.7.3 users are now allowed 500 requests every 10 seconds and every single one of these requests is constantly being used. No surprise here. 🙂

For the exact details on what is throttled and how to get around your application being throttled, see our rate limiting documentation.

Current web service rate limiting documentation

We’ve just added a page that documents what we’re currently blocking on our Web Service. We hope to lift the block on python-musicbrainz/0.7.3 tomorrow and instead throttle the number of requests it can make in a given period of time.

I’ll post another entry once we’re done with making those changes.

Web service user-agent string blocking reminder

I would like to remind Web Service users that on 16 November we’re going to block generic User-Agent strings from accessing our web service. Earlier we said:

The User-Agent string needs to identify the application and the version of the application that is making the request; having a generic User-Agent string like “Java/1.6.0_24″ or “PHP/5.3.4″ does not allow us to properly identify the application making the requests.

IMPORTANT: 6 Months after we release NGS (Nov 16th) we’re going to start blocking common generic User-Agents strings, so please make sure that you send us a proper User-Agent header as part of your request.

You have been warned. 🙂

Google uses MusicBrainz data in some of its searches!

Earlier this week I met with Shawn Simister, who works on Google’s Freebase project (former from MetaWeb) to touch base about how MusicBrainz is being utilized inside of Google. MusicBrainz represents a large chunk of the music data in Freebase and in turn the Freebase data is used as one of the sources of data for Google’s search.

Shawn explains this in more detail:

You can actually see a couple areas where we’re using the Freebase music data publicly. First, in the structured refinements in search. If you search for lady gaga albums and scroll to the bottom to see “Album searches for Lady Gaga”. Also you can see videos clustered by topic in YouTube Topics and many of the topics are music-related.

It’s important to keep in mind that Musicbrainz is just part of the solution. It’s a pretty big part of Freebase music data and therefore its likely to be a pretty big component in these results but as you know the search results team at Google is pretty secretive about what all goes into the results page so even I can’t tell for certain when they’re using Freebase/Musicbrainz data for any given result.

I think it’s important that people don’t mistake this as a one-to-one relationship between Musicbrainz data and Google results because there are quite a few steps in between but there’s definitely a strong connection there and we really appreciate everything that the Musicbrainz community is doing and hope that Musicbrainz community continues to grow.

I find this tremendously exciting to hear, since I proposed a very similar thing to Google many years ago. While this idea was rejected back in the day, I’m excited to see that Google is now using our data for it searches. Every person who has ever contributed to MusicBrainz should be proud!

Thank you to everyone and thank you Shawn for shedding some light on this!

New NGS Data

We’ve pushed out the latest and greatest NGS data set (2011-02-22) and updated the test.musicbrainz.org with this data and restarted the NGS replication. If you’re are testing the replication for NGS, you will need to import this new data set.

A new virtual machine with the latest NGS code and the data set is being pushed out to the FTP server now. More on that tomorrow.

VLC, we love you, but we're seeing too much of you!

Dear VLC:

In the last month you grew to more than 25% of our traffic and this is hard for us to handle. We’ve limited the number of requests from VLC to 100 requests per 10 seconds. We need to find a way to help cover our costs for all of this traffic.

In other words, we need to sit down and chat, VLC. Please contact us!

Respectfully,

MusicBrainz