General update: What's up with TRM??

This general update is way overdue — a lot of things have been happening behind the scenes and its time to let everyone know where things in the MusicBrainz world are headed. I’ll start off with TRM, since that is hot discussion topic on the musicbrainz-users mailing list right now.

The TRM (TRM’s are acoustic fingerprints that MusicBrainz uses to identify music tracks) server is constantly overloaded and can only handle a database size of about 2.2Gb before it crashes. To prevent crashes, we prune the database where we throw out the least used TRMs, which implicitly discards work that our users have done. Not good. In order to make the TRM server perform at some reasonable level of performance, the entire database needs to be kept in RAM. Thus our server has 5GB of RAM and it still can’t keep up. The fact that this problem hasn’t reared its ugly head to the public, is a testament to Dave Evans’ skill in keeping the TRM server ticking.

Furthermore, TRMs have shown themselves not to be as unique as we would’ve liked. For example, take a look at the TRM’s with at least 5 tracks report: 4400 pages (!) of TRMs that I would consider to be sub-optimal. One example TRM (non silence on page 2) has 104 tracks associated with one single TRM. Given this, TRM is not some sort of magical solution that with great authority tells the tagger what metadata to apply to a track. Instead, its best to think of TRM as a system that lets you guess which few dozen tracks a file could be matched to — there is a lot of logic in the tagger that makes up for the shortcomings of TRM.

Thus, TRM has two major problems: its not accurate enough and it doesn’t scale well to the size that MusicBrainz has grown to. The system still functions but I expect it to start breaking down and becoming of less use over time. We have the following options:

  1. Find a replacement for TRM: Relatable doesn’t seem to be in business anymore, or at least they are in deep hibernation. No other companies that I have approached were interested in sharing their technology with MusicBrainz. (For the record, I’ve tried with 3 companies, including a couple of on-site visits in Europe).
  2. Create our own TRM solution: This is an very large endeavour — at least a year if not two, of hard work. I’d rather work to improve MusicBrainz itself, rather than hacking on acoustic fingerprint software.
  3. Throw more resources at TRM: We’re still lacking the funds for more resources, and the same argument in #2 still applies.
  4. Do something else: Find some technology that can replace TRM.

Given my babbling about Lucene, I think its a foregone conclusion that #4 is the way to go. Sometime this fall, I will release a Picard tagger with a lucene text indexing engine to replace the current MusicBrainz Tagger. The benefits of this new tagger will be:

  1. It will distribute the load on the server, since currently a large chunk of the server load goes to supporting tagger users. And a large chunk of tagger users never really contribute data to MusicBrainz or make cash donations to support the project. So, moving that traffic off the main server will allow people who want to edit/vote on the data focus on their work.

    Given that most files in the wild nowadays have some metadata, a text index will work well. Lucene is great at taking crappy data input and coming up with something useful. If TRM gets us into the ballpark and then additional heuristics do the final leg work, Lucene will give us a much better guess to start with than TRM ever did. Thus, overall tagging quality will improve greatly.
  2. A lucene tagger will work much faster than the TRM based tagger ever was. 2-5 seconds per track was not unusual given TRM — with Lucene we’ll see 2-5 tracks per second, if not much faster.
  3. Since we will no longer have to decode files to identify them, it will be easier for us to support new formats. Its less work overall.

This approach also has the following downsides:

  1. It will no longer support identifying completely anonymous files. Files that have no id3 tags and are named test1.mp3, test2.mp3 will simply not stand a chance at identification. I realize that there is great romance associated with this concept, but in reality most people have files that have some metadata in them, and thus will stand a good chance of being identified.
  2. You will need to download a 250MB Lucene index to tag your collection. This is a pretty big hurdle, but if BitTorrent can routinely help people download 650Mb movies off the net, it should help us download distribute our search indexes. After the first release of a Lucene enabled Picard, we will investigate P2P searching methods that will allow people who have no index to use some other people’s indexes (if they allow that).

So, the roadmap for this looks like this:

  1. Release picard 0.5.0 in the next few weeks and start putting it on the main page as an alternative to the MB tagger.
  2. Release picard 0.6.0 with full Lucene support and offer that as the main tagging solution for MB.
  3. When the TRM usage drops because of adoption of Picard 0.6.0, we will start phasing out TRM.

There you have it — thats the current happenings on TRM and how we hope to solve the problems that it presents us with.

Bad news: Picard on OS X

In the last few days I’ve been playing around with Picard on OS X. After fixing a few bugs in libtunepimp that prevented it from compiling on OS X, I managed to get Picard to come up. However, there are so many UI bugs that it is essentially unusable:

  1. Drag and drop does not work
  2. Some options dialog items won’t un/check
  3. Adding files from Add Files dialog doesn’t work
  4. The UI is butt-ugly

This is the same code that has undergone a fair amount of debugging on Windows and Linux. Given that the code works fairly well on those two systems, I have to suspect the wxWidgets toolkit on Mac OS X. I looked into a number of UI bugs listed above only to be stumped by these bugs on multiple occasions. The code looks ok and works great on two platforms. No manner of tweaking the code allowed me to make any headway on any of the bugs.

My conclusion: wxWidgets on OS X, even the 2.6.x version, is still not ready for prime time. Thus, I’m sad to say, Picard won’t be coming to OS X soon. If someone has more experience with wxWidgets on OS X and would like to take a stab at looking at these bugs, please do. At this point I should spend my time on bugs that will make Picard better on the two platforms where there is hope.

I’m bummed. 😦

German mirror online!

After months of tinkering, with I’m pleased to announce that MusicBrainz now has a mirror in Germany. The mirror is graciously being sponsored by HousePool Media International Group — many thanks to Carsten Marmulla for working hard over a number of months to find a hardware and bandwidth to support this mirror.

Our two mirrors (.de and .nl) are currently underutilized, but the upcoming release of Picard will have support for tagging of mirror servers. We’ll have to encourage users to use the mirrors for tagging, so that the main server can stay available for people wanting to make changes to the database or vote on pending changes.

Summer is over!

Well, almost. I’m back from OSCON, Foo Camp, Burning Man and the Future of Music Conference. Traveling was fun, but I’m ready to wait for the not-so-nice weather and cuddle up with a computer and get some serious MusicBrainz work done.

The good news is that the data licensing revenue should start rolling in within a few weeks, which means that I get to keep working on MusicBrainz full time! Full time and paid — at first it won’t be much of a paycheck, but it should pay the bills. Maybe next year we can work towards a full paycheck — we’ll see.

Here is my todo list for the near future:

  1. Whip mirror servers into shape
  2. Sign more license deals
  3. Get the menu server release out the door
  4. Fix AR bugs, improve related artists, hammer out a few new server features.
  5. Release Picard 0.5.0, libtunepimp and libmusicbrainz — all of these desperately need new releases.

Of course there are lots more things on my todo list, but these are the top 5 items. Stay tuned for more info!