I previously mentioned that Lucene rocks — well, that is not giving it enough credit. I’m working on the guts to a Lucene enabled Picard tagger, and in doing so I have created a simple script that chewed through a given set of mp3 files and attempts to match them up with MusicBrainz.

My friend Vee once gave me a CD full of hip-hop music to give to my GF. I took one look at it and stared in shock! What a mess — not many id3 tags, mostly no album names at all. Lots of friends vs friendz problems — much slang used in inconsistent ways. Ick!

I ran this through the old tagger a while back and it matched roughly 30% of the tracks. I’ve been using this set of files to tune the new tagging engine and once things got cached into memory, it chewed through over 100 files in under 7 seconds:

60% matched: 64 files matched, 41 files with suggestions, 1 files not matched.

60% !! Check the results for yourself!

And of the 41 files that have suggestions at least 80% of them have the correct match in the top 3 closest matches. I’m floored — it works so well, and there are a number of improvements still left to make. The downside? You need the 700Mb lucene index on your hard drive. That’s going to be more than 250Mb to download. 😦 I’ll have to work out the right combination of BitTorrent, caching, and P2P solutions to tackle that minor issue.

But this is really stunning!

5 thoughts on “Lucene based tagging update”

  1. Awesome, can’t wait until the new tagger is in a fairly usable state. I’ve been putting off tagging my music (and therefore putting off installing NetJuke) for quite a while. Appreciate the work..

  2. well, I have a 80gig or so hardrive, and also bittorent installed, I am willing to host a 1gig big file to be downloadable by kazaa and bittorent

    if and when you have a bittorent file available tell me and I’ll get it. I have broadband and I’m pretty much online 24/7

    if that helps at all I mean 🙂


