Lucene based tagging update

I previously mentioned that Lucene rocks — well, that is not giving it enough credit. I’m working on the guts to a Lucene enabled Picard tagger, and in doing so I have created a simple script that chewed through a given set of mp3 files and attempts to match them up with MusicBrainz. My friend … Continue reading “Lucene based tagging update”

I previously mentioned that Lucene rocks — well, that is not giving it enough credit. I’m working on the guts to a Lucene enabled Picard tagger, and in doing so I have created a simple script that chewed through a given set of mp3 files and attempts to match them up with MusicBrainz.

My friend Vee once gave me a CD full of hip-hop music to give to my GF. I took one look at it and stared in shock! What a mess — not many id3 tags, mostly no album names at all. Lots of friends vs friendz problems — much slang used in inconsistent ways. Ick!

I ran this through the old tagger a while back and it matched roughly 30% of the tracks. I’ve been using this set of files to tune the new tagging engine and once things got cached into memory, it chewed through over 100 files in under 7 seconds:

60% matched: 64 files matched, 41 files with suggestions, 1 files not matched.

60% !! Check the results for yourself!

And of the 41 files that have suggestions at least 80% of them have the correct match in the top 3 closest matches. I’m floored — it works so well, and there are a number of improvements still left to make. The downside? You need the 700Mb lucene index on your hard drive. That’s going to be more than 250Mb to download. 😦 I’ll have to work out the right combination of BitTorrent, caching, and P2P solutions to tackle that minor issue.

But this is really stunning!

Lucene web service

In the last two weeks I managed to combine working on MusicBrainz, creating a new open source project and earning money to pay the bills! This is quite rare these days, so I am pleased all around. As some of you may know, I have been doing contract work for CD Baby. When Derek, the … Continue reading “Lucene web service”

In the last two weeks I managed to combine working on MusicBrainz, creating a new open source project and earning money to pay the bills! This is quite rare these days, so I am pleased all around.

As some of you may know, I have been doing contract work for CD Baby. When Derek, the owner and lead geek at CD Baby, asked me what MusicBrainz does for searching, I launched into a long cheerleading rant about Lucene. I managed to convince Derek that Lucene is the way to go, and to convince him to sponsor the open source development of the new Lucene Web Service. Luckily Derek agreed that as long as the project was going to be available under the BSD license that he would agree to open source the work.

Triple cheers for Derek and CD Baby please!

So, the web service is now done and I’ve applied for a new project on SourceForge — once that is approved, I will release the source code for everyone to check out. I’ll post another message here when that is complete.

If you’d like to check out the working web service, try this link.

Blog spam attack

In the last 24 hours this blog has been attacked by nasty porn spammers. I’ve erased dozens of spam comments and trackbacks. πŸ™‚ I’ve turned off trackbacks for now ( 😦 ) and I will add comment captchas later today to try and stop this crap. Anyone have a good suggestion on how to manage … Continue reading “Blog spam attack”

In the last 24 hours this blog has been attacked by nasty porn spammers. I’ve erased dozens of spam comments and trackbacks. πŸ™‚

I’ve turned off trackbacks for now ( 😦 ) and I will add comment captchas later today to try and stop this crap. Anyone have a good suggestion on how to manage trackback ping spam?

Non-profit application filed

At long last, after many hours of work and months of time passing, I’ve FedEx’ed off the 1023 Application to the IRS. The 1023 form is the tax-exempt application to the IRS — once we get an advance ruling on our status (the final ruling will come many months down the road) we’ll be able … Continue reading “Non-profit application filed”

At long last, after many hours of work and months of time passing, I’ve FedEx’ed off the 1023 Application to the IRS. The 1023 form is the tax-exempt application to the IRS — once we get an advance ruling on our status (the final ruling will come many months down the road) we’ll be able to conduct business as a real non-profit and start handing out tax-deductible receipts for the donations we receive for MusicBrainz.

I feel confident that we’re well setup — this is mainly due to the excellent guidance I’ve received from Randy Heinig at Barack Ferrazzano Kirschbaum Perlman & Nagelberg LLP in Chicago. Thank you very much Randy for all your patience, hard work and thorough understanding of what MusicBrainz does!

Hopefully this will also mark the point where I can spend a little less time on the non-profit and start hammering out more code for MusicBrainz — it’s sorely needed. But before I dream of that, we still need to announce this new venture. Stay tuned for more on that!

MusicBrainz Pub Night

I’d like to announce a MusicBrainz Pub Night in London Nov 30th at 6:30pm. If you’d like to meet Dave Evans and myself in person, chat about MusicBrainz and have a pint, please come join us. Right now I haven’t picked out a pub yet, so if you have a suggestion for a good pub … Continue reading “MusicBrainz Pub Night”

I’d like to announce a MusicBrainz Pub Night in London Nov 30th at 6:30pm.

If you’d like to meet Dave Evans and myself in person, chat about MusicBrainz and have a pint, please come join us. Right now I haven’t picked out a pub yet, so if you have a suggestion for a good pub (should be a classic pub, large enough to handle 6 – 10 of us, not too noisy), please leave a comment.

So, who can make it?

Lucene enabled Picard

In the last couple of days I’ve stuffed Lucene into Picard and it has given me some quite amazing results. I’ve opened a collection of untagged files and watched it open the right albums and populate it with tags automatically. Mind you, none of the tags were previously tagged with MB ids. Plain amazing! I … Continue reading “Lucene enabled Picard”

In the last couple of days I’ve stuffed Lucene into Picard and it has given me some quite amazing results. I’ve opened a collection of untagged files and watched it open the right albums and populate it with tags automatically. Mind you, none of the tags were previously tagged with MB ids. Plain amazing!

I have this hip-hop compilation that my friend put together and its utter crap — duplicates, many files without tags, crappy spelling and mostly from greatest hits albums. Ick. The original tagger identified less than 15% of the tracks. The new tagger identifies 50% – 60% of the tracks — that’s a really good rate for this crappy collection.

Continue reading “Lucene enabled Picard”

I did mention that Lucene rocks, right?

I decided that I wanted to put together a comprehensive test of Lucene so I could show how powerful, fast and accurate Lucene is. This is just a simple Python script that is not integrated to the rest of MusicBrainz — it doesn’t even touch the Postgres DB! My little test is hosted in the … Continue reading “I did mention that Lucene rocks, right?”

I decided that I wanted to put together a comprehensive test of Lucene so I could show how powerful, fast and accurate Lucene is. This is just a simple Python script that is not integrated to the rest of MusicBrainz — it doesn’t even touch the Postgres DB!

My little test is hosted in the staging server — the DNS should’ve propagated by now. Come check it out:

http://search.musicbrainz.org

Lucene rocks!

I’ve been playing with the Lucene text indexing system (in particular, I’m playing with PyLucene, which is a GCJ compiled version of Lucene with Python bindings). Lucene does text searching really well and its fast! Eventually I’d like to use Lucene to power the MusicBrainz searches as was as building a copy of it into … Continue reading “Lucene rocks!”

I’ve been playing with the Lucene text indexing system (in particular, I’m playing with PyLucene, which is a GCJ compiled version of Lucene with Python bindings). Lucene does text searching really well and its fast!

Eventually I’d like to use Lucene to power the MusicBrainz searches as was as building a copy of it into Picard. Picard? Yes! Lucene is so good, that you can give it a track title and chances are its going to find the right track. My idea is this:

  1. Cluster new files and determine which artists these files cover.
  2. Download and cache the metadata for the artists locally, and build a lucene index of it.
  3. Throw each of the tracks at lucene to see what it can match.
  4. If nothing matches, maybe do a full DB search via the web service or do a TRM calculation.

I’m excited by this — the proof of concept looks fabulous. Executing it on the full scale where things are getting cached and locally indexed, is going to be a fair amount of work. Unfortunately.

But, this gives me hope that Picard will have some serious brainz under the hood. πŸ™‚

First tax-exempt application filed

I’ll jump in right now and update you on my progress. I just dropped the FTB3500 tax-exempt application to the State of California into the mail. This application is one of the two big ones that took many weeks of preparing and creating budget forecasts for the next two years. Budgets are not my strength, … Continue reading “First tax-exempt application filed”

I’ll jump in right now and update you on my progress.

I just dropped the FTB3500 tax-exempt application to the State of California into the mail. This application is one of the two big ones that took many weeks of preparing and creating budget forecasts for the next two years. Budgets are not my strength, but our Treasurer helped me with this process and we got it done. Next up is the biggest and most dreaded form — the 1023 application to the IRS.

I’ve also got the first cut at the MetaBrainz web site created — this site will detail everything about the non-profit including all donations and finances, board of directors and other non-profit stuff. Of course the new web-site is not going to be public until we’re ready to announce every last detail of the new non-profit. Stay tuned!

Oh, yeah — I also created this blog this week. Maybe tomorrow I can start hacking on advanced Picard features.

Welcome to the MusicBrainz community weblog!

We recently started discussing setting up a blog for MusicBrainz contributors to post information about the work they are doing in order to keep the community up to date. Having gotten no negative feedback on the idea, I proceeded to make it happen. I really like Movable Type — its a great piece of software, … Continue reading “Welcome to the MusicBrainz community weblog!”

We recently started discussing setting up a blog for MusicBrainz contributors to post information about the work they are doing in order to keep the community up to date. Having gotten no negative feedback on the idea, I proceeded to make it happen.

I really like Movable Type — its a great piece of software, and SixApart agreed to donate license for Moveable Type — that’s a $249.95 value! We can now publish this blog and add up to 35 users to this blog. I think that should suffice for the immediate future. πŸ™‚

Thank you very much to Mena, Ben, Mie and Barak at SixApart! You guys rock and so does your software!

If you are a MusicBrainz contributor and would like to get an account to post this webblog, please send me some mail and I’ll set it up.