Summer of Code log analysis project: May we share our data with our GSoC student?

UPDATE: This clearly going to be a major hassle, so we’ll spend the extra time coding a program that will sanitize the data before it goes into splunk.

Last week Google’s Summer of Code program started and my student Dániel Bali is ready to get busy combing through our massive logs and see what sorts of information he can mine from our logs.

We only have one minor problem — our logs contain the IP addresses of our users and some requests contain the user names of the person making the request. Removing this private information from the logs before Dániel sees them is quite a pain to do well.

I would like to propose that we:

  1. Consider Dániel part of our core team for the summer and allow him to see IP addresses and all the requests in full.
  2. Have Dániel sign a short statement stating that he will not divulge any private information.
  3. Will fail him in his GSoC project if he does divulge any private information.

If this is not acceptable to you, please speak up soon. I would like to make this happen early next week so Dániel can continue his GSoc work.

UPDATE: The final output of Dániel’s work will not contain any private information. If we end up using any private data as input, we will sanitize it and remove private information before we publish the output.

Our forums are back!

Our forums were compromised a while ago and we had to undergo massive yak shaving in order to set up a new home for our forums. Hosting many different types of software on one server makes that server hard to administer — we felt that the proper solution was to create a Virtual server host and give each type of software a new Linux instance to live in. We started with the forums, but we’re going to be moving a lot more stuff over to this virtual host in the coming weeks/months. Hopefully we can update our blog and our wiki in this process.

In any case, the forums are back online and running with the old posts, but the latest version of PunBB. In the move we lost our customized MusicBrainz theme for PunBB — if someone feels strongly about having the theme, please take a look at the PunBB docs and create a new theme. I’ll be glad to install that new theme on our server.

Thanks and sorry the forums were offline for so long!

Server update, 2012-05-28

We’ve just finished pushing an update to our web servers. This is our first release since the schema change, and we’ve tried to address the problem of artist landing pages. As a temporary solution, we’ve split the page up by type a bit more now, which we hope is a step in the right direction. We’re currently discussing this at User:Reosarevok/Overview Options on the wiki, and your feedback is important. If you feel strongly about this, please have a read at the ideas on that page and feel free to comment/add your own.

This release features work from Ian McEwen, nikki, Joachim LeBlanc, Nicolás Tamargo and the rest of the MusicBrainz team. Thanks everyone for your hard work!

Bug

  • [MBS-1121] – Disabled submit buttons have no distinctive style
  • [MBS-3861] – IMDb links not fully normalized
  • [MBS-4278] – Copy changes to recordings edits the wrong recordings if recording associations were changed
  • [MBS-4336] – Release editor > removing a track resets manually changed track positions
  • [MBS-4520] – Deleting tracks does not update track numbers
  • [MBS-4569] – Full page titles aren’t shown in /doc
  • [MBS-4621] – Don’t use empty <img> tags on pages with no images
  • [MBS-4649] – YouTube channel autoselect is broken
  • [MBS-4656] – “Video” option for “can be streamed for free at” relationship is listed under the “License” subheading.
  • [MBS-4662] – ISE: Can’t edit a relationship attribute
  • [MBS-4664] – Use of uninitialized value in sprintf at lib/DBDefs.pm line 314.
  • [MBS-4681] – Edit maked as Applied, but it isn’t
  • [MBS-4687] – Link to create new relationship types doesn’t work
  • [MBS-4697] – Approve votes missing from edit search
  • [MBS-4720] – Audiobook is a primary release-group type
  • [MBS-4733] – Merging release groups fails if release groups have secondary types
  • [MBS-4752] – Musicbrainz webservice <iswc-list> changes break compatibility with existing applications.
  • [MBS-4757] – Edit artist alias (sort name) is auto-edit
  • [MBS-4760] – Release groups with secondary types cannot be deleted
  • [MBS-4765] – DB_SCHEMA_SEQUENCE hasn’t been updated to 15 in DBDefs.pm.default
  • [MBS-4767] – Cannot accept edit release group edits that change the primary type to a type that no longer exists
  • [MBS-4769] – Bad description row in statistic_event
  • [MBS-4770] – ISE: Error when requesting non-existent relationship types on /relationships
  • [MBS-4772] – ModBot cannot apply old edit work edits that add ISWCs

Improvement

  • [MBS-684] – TOC lookup displays too little release info
  • [MBS-1874] – Search for the documentation
  • [MBS-3748] – Adding new instruments is a pain
  • [MBS-4298] – Confusing text when merging releases+recordings
  • [MBS-4561] – Make disambiguation on tracklist credits smaller
  • [MBS-4568] – Add <bdi> tags to help with rendering of RTL text
  • [MBS-4657] – Add cover art to timeline
  • [MBS-4700] – Fix inline buttons
  • [MBS-4742] – Add a mention to the “Split Into Separate Artists” page that aliases prevent a split (or automatic removal)
  • [MBS-4762] – Cover art statistics should be displayed on the tabular pages.

Task

  • [MBS-4686] – Add wikisource.org to the lyrics whitelist

In other news, Oliver will be on holiday for one week (yipee!), and will be back in a week. Reachable via email if need be.

The commit sha for this release is 8fbbc36, a Git tag will follow when Rob is back tomorrow. and the git tag is v-2012-05-28

Post schema change fix for importing clean data

Yesterday we found a bug that prevents the import of a post schema change update data set. We’ve pushed out a fix for this and tagged it with:

v-2012-05-15-import-fix

If you’re planning on importing a new data set, make sure to check out this tag, rather than the tag mentioned in this entry.

Search server release: 2012-05-15

In case you haven’t gotten enough of release announcements, we have another one for you. Yesterday during the main releases we also released a new search server to match the main server release. Thanks much to Paul Taylor for working on this release to be timed perfectly!

UPDATE: The search server and the MMD schema repositories have been tagged with this tag:

release-2012-05-15

Bug

  • [SEARCH-198] – The artist is getting a lowered score on MBS
  • [SEARCH-199] – Search includes empty annotations
  • [SEARCH-200] – Search on release giving to much boost to matches on CatalogNo
  • [SEARCH-201] – explain option doesnt work if search results contain non ISO-8859-1 characters
  • [SEARCH-216] – Null pointer exception when building freedb

Improvement

  • [SEARCH-157] – Be able to search for a track by its metadata OR its puid
  • [SEARCH-186] – Search Server has hard coded redirect URL
  • [SEARCH-187] – Update Junit Test from 3 to 4
  • [SEARCH-202] – Allow searching for RGs based on their releases’ status
  • [SEARCH-204] – Upgrade codebase to Lucene 3.6
  • [SEARCH-214] – Add release group ID to the web service indexed search results for recordings

New Feature

  • [SEARCH-205] – Search server should return multiple ISWCs for works
  • [SEARCH-207] – Changes due to introduction of ISO-3 language code
  • [SEARCH-208] – Chnages due to Split release group attributes into two types Schema Change
  • [SEARCH-209] – Support for Multiple IPI Artists
  • [SEARCH-211] – Support for new Track ‘Number’ field in a track
  • [SEARCH-212] – Add abiility to index, display and search works by lyrics language as part of schema change
  • [SEARCH-213] – Changes due to MBS-1385:Support unknown end dates

Task

Announcing libmusicbrainz releases 4.0.3 and 5.0.1

Regrettably, a couple of errors were found close to the release of v4.0.2 and v5.0.0. I have just released v4.0.3 and v5.0.1 with the following changes:

– Fix LMB-32 – Correctly ignore unrecognised nodes
– Don’t compile using -Werror when building from tarball

The releases are available:

libmusicbrainz-4.0.3.tar.gz
(MD5 checksum: 19b43a543d338751e9dc524f6236892b)

libmusicbrainz-5.0.1.tar.gz
(MD5 checksum: a0406b94c341c2b52ec0fe98f57cadf3)

Documentation for the new version is available under

http://metabrainz.github.com/libmusicbrainz/

Apologies to all for the need to make this release so soon after the last one.

Andy

Schema change server update, 2012-05-15

Nearly one year after we released NGS, we have another schema change update with lots of new features!

This release contains 9 new features and improvements that take advantage of the new schema. These are:

  • More social user profiles which can now have Gravatars, languages (and the users proficiency) age and country.
  • More expressive aliases for artists, labels and works. Aliases can now have types, sort names and multiple aliases may be used per a locale, along with the ability to mark one alias as ‘primary’ for that locale.
  • Release group types have been separated into primary and secondary types. A release group now has 1 primary type and may have multiple secondary types. This allows us to have ‘remix compilation albums’, for example
  • Works may have multiple ISWCs
  • Artists, labels and relationships may be marked as ‘ended’ to indicate that they have ended, but the exact date is not known
  • Vinyl style/free text track numbers are now supported.
  • Works may have a lyrics language associated with them
  • Artists and labels may have multiple IPIs
  • We have moved to use ISO 639-3 for our language table. While not all languages are exposed at the moment, this gives us a lot more flexibility going forward.

Many thanks to nikki for going way beyond our expectations for testing (and patience!); to Ian McEwen for his continued work on statistics; and to the MusicBrainz team for making this all happen.

If you have a replicated instance of MusicBrainz, please follow these instructions to get your server running on the new schema:

  1. Take down the web server running MusicBrainz, if you’re running a web server.
  2. Turn off cron jobs if you are automatically updating the database via cron jobs.
  3. Make sure your REPLICATION_TYPE setting is RT_SLAVE
  4. Switch to the new code with git fetch origin followed by git checkout v-2012-05-15-schema-change
  5. Run carton install --deployment. If you have not switched your installation to using carton, please read INSTALL.md on how to do this.
  6. Run carton exec -- ./upgrade.sh from the top of the source directory.
  7. Set DB_SCHEMA_SEQUENCE to 15 in lib/DBDefs.pm
  8. Turn cron jobs back on, if needed.
  9. Restart the MusicBrainz web server, if needed.

If you are running a mbslave mirror, check out the latest code and read the upgrade instructions in the README file.

Bug

  • [MBS-3189] – Remove unused ref_count column and related functions
  • [MBS-4616] – Add work language statistics
  • [MBS-4629] – /cover-art page shows no collections
  • [MBS-4637] – Timeline graph won’t graph anything without an entry in statistics/view.js
  • [MBS-4640] – Clicking cover art opens box with “����” (4 U+FFFD)
  • [MBS-4642] – Thickbox CSS interferes with MB CSS
  • [MBS-4647] – Cover art page allows submitting edit with no cover art when JS is off
  • [MBS-4648] – Changing cover art type from “other” to unset causes Internal Server Error
  • [MBS-4678] – upgrade.sh is not ready for testing
  • [MBS-4679] – Internal server error adding secondary types to a release

Improvement

  • [MBS-1485] – Alias types
  • [MBS-1798] – Lyrics language for works
  • [MBS-1799] – Add ISO 639-3 language codes to the database
  • [MBS-1981] – Add blog feed to the home page
  • [MBS-2240] – Aliases: certain locale can be used only once in the list of aliases
  • [MBS-2532] – Allow more than one IPI per artist
  • [MBS-2851] – Timeline graph events should be in the database
  • [MBS-2885] – Allow more than one ISWC per work
  • [MBS-3646] – Split release group attributes into two types
  • [MBS-3788] – Alias improvements
  • [MBS-4625] – Improve wording of cover art tab when cover art comes from relationships
  • [MBS-4676] – Do not allow people entering deprecated relationships

New Feature

  • [MBS-842] – Allow vinyl style track numbers and sides
  • [MBS-1385] – Support unknown end dates
  • [MBS-3704] – Allow adding sort names to artist aliases
  • [MBS-4337] – Make user profile more social: add (optional) fields avatar, gender, birth year, country

Announcing libmusicbrainz4 releases 4.0.1 and 5.0.0

Andy Hawkins says:

Hi,

I am pleased to announce two new versions of libmusicbrainz:

libmusicbrainz-4.0.2 has been updated to take account of changes made to the server on 15th May 2012. Some interfaces are now marked as deprecated, as they have been extended during the work.

Full documentation is available here:

http://metabrainz.github.com/libmusicbrainz/4.0.2/

The release can be downloaded here:

https://github.com/downloads/metabrainz/libmusicbrainz/libmusicbrainz-4.0.2.tar.gz

(MD5 checksum 5ff62abeca00fdad1bb3a8f99065ae61)

libmusicbrainz-5.0.0 has been introduced to enable the library to be more easily included in Debian due to a conflicting package name. It is identical to libmusicbrainz-4.0.2, with the following exceptions:

1. All include files are now in the musicbrainz5 directory

2. You should now link against libmusicbrainz5 (-lmusicbrainz5)

3. All previously deprecated functions have been removed.

Please note that all future work is likely to only occur on the 5.x library, so this should be used wherever possible.

Full documentation is available here:

http://metabrainz.github.com/libmusicbrainz/5.0.0/

The release can be downloaded here:

https://github.com/downloads/metabrainz/libmusicbrainz/libmusicbrainz-5.0.0.tar.gz

(MD5 checksum 3396e0c66cfacfa1f32abc7cfdbcbe13)

As ever, please report any issues in JIRA at

http://tickets.musicbrainz.org

under the project ‘libmusicbrainz’.

If you have any questions, please post them in the musicbrainz-devel mailing list. I will also attempt to be available on the #musicbrainz-devel IRC channel on Freenode.

Search server index updating paused

Tomorrow’s release requires us to update the main server and the search servers at the same time. This presents a bit of a chicken-and-the-egg problem: We need to build new indexes even before we’ve migrated the database to our new schema.

To accomplish this, I’ve stopped index updating and created a separate database that will allow me to build some indexes that will be a few hours out of date when we release tomorrow. They will be slightly old, but at least we will have indexes that work.

I expect the indexes to be up to date about 3 hours of the release is complete. If you find the indexed search out of date, please use the direct search in the meantime.

Sorry for any inconveniences this may cause.