A common problem for users of MusicBrainz is that of synchronizing a local collection against the main MusicBrainz servers. Our current rate limit stipulates that you make at most 1 request per second, which we understand is extremely limiting – especially if you’re trying to fetch thousands of releases! During our first hack weekend, we created the beginnings of a service to allow you to get a list of MBIDs that have been updated. We have finished the preliminaries of this service, and now we need to hear from you how you’d want to utilize this.
Change Logs
The most basic data we currently gather is a JSON document containing a list of MBIDs that have changed per hour. For each of our data replication packets, we generate a JSON packet that summarizes all of the MBIDs that have changed, either directly on indirectly (such as the addition of more relationships).
A “What’s Changed?” Service
The first piece of feedback we received was that people were not really interested in consuming this data stream, but would rather have a service that allows them to query what data has changed in a given window of time. Having to manually fetch packets and perform set intersections is not particularly difficult, but the more hoops people have to jump through, the less likely they are to even use the service. We’ve been pondering how best to implement this service, and we would like feedback on the following options:
-
Filter a list of MBIDs
The service would allow you to
POST
a set of MBIDs, and would in turn return the subset of these MBIDs that have been changed. You are able to specify any date and have all changes since that date. For example, you could find all changes to all releases in your library since you last checked 2 weeks ago.Because every MBID would take 36 bytes to submit, there will be a limit on the amount of MBIDs that can be submitted in order to preserve bandwidth.
-
Provide client libraries
Rather than having people craft their own web service requests, MusicBrainz should provide a library to do this. This will allow us to use more advanced techniques (for example, Bloom filters) to both conserve bandwidth, and allow for larger queries. In this scheme the web service will be documented, but users are not expected to consume it directly.
-
Support Both!
MusicBrainz could offer a simplified API, which is based on option 1, while also supporting larger queries through option 2. For example, we might limit option 1 to have a maximum of 4000 MBIDs per request/response, while the service that depends on our client libraries could handle many more.
-
Allow filtering based on collections
MusicBrainz already has the concepts of collections, which have an associated unique identifier, so these will be used to filter the list of changes. This limits the service to only deal with releases, and will require people set up collections before they can do queries. Again, due to the possibility of large collections, there will likely be pagination on responses – though the per-page limit will probably be fairly high.
These are the ideas that we’ve been debating, and we’d love to know which of these would work for you. If you have other ideas, we’re also very interested in hearing what those are!
I think of this as an advanced service, so a simplistic API is not necessary. I would like bloom filters and collections. Personally I would probably use collections, but there are use cases where this leads to gigantic collections which might have other problems.
The normal user is probably only interested in his collection anyways and querying by collection is simple enough.
I’d prefer #1, not so fussed about #2, since if the documentation’s good enough I can just use #1 for everything.
I’ll typically be querying about 1000 mbid’s at a time, so I’m not too worried about a proposed limit of 4000.
I think #4 could work, but only in addition to #1. I wouldn’t want to rely on users having collections for using changed mbids with http://wiki.musicbrainz.org/User:LordSputnik/Warp.
I think options 3 and 4 make the most sense: presumably the API and library would be developed with the other in mind. So there’d be an efficient library in, say, Python, but an open, documented API would still allow me to write my own library in Lisp or Intercal or whatever if I wanted to.
As for #4, ideally it should be possible to send either a (relatively small) number of MBIDs, or a collection ID (which potentially represents a much larger number of MBIDs).
It’s useful to be able to ask “what has changed in my collection since last month?”, but I also find myself re-tagging one song or one album at a time. Call it a hundred MBIDs for the release group, release, tracks, works, and artists, on a box set.
If I can’t submit those 100 MBIDs in one operation, I’ll work around it by defining a temporary collection, querying that collection, and deleting the collection. This seems inefficient for small sets of MBIDs, but I don’t know how small “small” is. I bet it also depends on whether they’re popular and changing MBIDs, or old and stable ones.
This is a really awesome idea — kudos to the MB team for pursuing this, whatever form it takes.
I actually think options #1 and #4, as presented, make the most sense. A very simple API like #1 is likely to cover many use cases, especially if the list of MBIDs can be compressed (gzipping of POST data). If the limit is, say, 4k MBIDs, this will still be a great improvement over checking one at a time and will be sufficient to make many large queries fast enough. Similarly, reusing the collections is a very nice solution and has the advantage of transferring *zero* MBIDs in the common case.
Although I’d be interested to hear more about how it would work, a more complex API seems to have only marginal benefit over the other options while taxing limited developer resources. The project would have to create and support several different client libraries, I imagine. This time could likely be better spent working on something simpler but making it work well.
Change log, containing MBIDs of modified artists would be extremely useful for muspy, and I believe for any website/service needing a small subset of the MB database. Whatever option you select for the advanced service, please also expose the raw change log.
Alexander: The change feed that we listed is exactly what you ask for. Have you taken a look?
This is great! At Last.fm, we regularly import data from MusicBrainz, and we aim to have every artist, release, and recording MBID linked to an entity in our catalogue. For this, we added custom triggers to our local copy of the MusicBrainz DB to find what changed with the last update. I had a quick look at the JSON change feed and I’m happy to work on a Java library that accesses it and hides its details.
Also, from time to time we need to compare all MBIDs in our catalogue against the MBIDs in the MusicBrainz DB (updates may fail and we end up with outdated/missing MBIDs). I’m interested to see if the proposed solution with Bloom filters can handle and simplify this.
For my customers Option #1 would probably suffice with a limit of 4000 mbids, as a 40,000 song lib would still only take 10 seconds to check all songs that have changed. A webservice call that supports Option #2 (bloom filters might be useful) , together with a client lib to simply things, but unless a Java lib was provided it would be of no use to me and I dont expect that to be the preferred langugae of the mb-dev team.
Option #4 is not terribly useful as it depends on the users using Musicbrainz collections. most of my tagger customers do not take advantage of Musicbrainz collections.
mayhem: The change feed looks perfect, thank you! I’ll update muspy to use it asap.