Summer of Code: We're in for another round!

I’ve not had a chance to blog about our participation in Google’s Summer of Code program this year, so it is time to fix this now. As you might guess, we’ve been accepted into the program again and were given 3 slots. We awarded the slots to:

  • Rearchitect/Improve the Release Editor by Michael Wiencek (bitmap): This proposal aims to re-work the guts of our Release Editor and to change the architecture to use one page and not a series of pages. This project is potentially massive, so the goal is to work on the guts of the editor while not making many (if any!) changes to the UI. But, bitmap is a veteran GSoC student and long time Picard contributor, so we’re excited to have him back!
  • MBS-6200: Add a “place” entity by Nicolás Tamargo (reotab): Our very own Reosarevok joins the GSoC ranks to implement the Places support. In our previous schema change release we added support for Areas and Reotab aims to finish this project by implementing Places. For more discussion and background on Areas and Places, please see this ticket in jira.
  • Repository for music reviews by Maciej Czerwiński (mjjc): The goal of this project is to create a site that allows anyone to write a non-neutral point of view review of an artist, a release or a recording. All of the reviews in this site will be licensed under a Creative Commons license to be compatible with MusicBrainz and its data.

I’m really excited by all of these projects and the people who are contributing. Summer of Code started yesterday, so we’ll see very soon what our three students will accomplish.

Search server update: June 13

On 13th June we updated the search servers once more. Thanks for fixing bugs and adding Area support, Paul!

Release Notes – MusicBrainz Search Server – Version 2013-06-13

Bug

  • [SEARCH-297] – Webservice Json output for aliases when searching is inconsistent with output when doing a lookup
  • [SEARCH-302] – search server json output use singular for a list of release-groups.

Improvement

  • [SEARCH-292] – Include area info in the indexed search artist and label results
  • [SEARCH-299] – Ouput TrackIds

New Feature

  • [SEARCH-301] – Search for Area by ISO 3166 code

Task

  • [SEARCH-273] – Support for multiple country/release events on release as as part of schema changes
  • [SEARCH-286] – Add areas to the indexed search

Urgent schema update required

On Friday 24 May, 2013 at 15:00UTC we’re going to make an urgent schema update to fix a problem that occurred during our schema change last week. Please read this whole blog post carefully!

This update will not make any changes to the schema, but it will fix some data issues that have appeared on slave servers.

We apologize profusely for these problems — we’re working hard to rectify this problem and we’re going to improve our processes going forward to ensure that future releases will not encounter these problems.

What went wrong

Due to a misunderstanding of our database system, the ‘track’ table will be corrupted on the majority of replicated slave databases after the schema 17 migration. Specifically, depending on the internal choices of a given postgresql installation’s query planner and other system details, any particular server can end up with a variety of incompatible permutations of the track table, where ‘id’/’gid’ pairs will generally point to the incorrect track data. Unfortunately, this problem is compounded by replication, which is based on the ‘id’ column. Therefore, replication packets since the schema change are likely to have deleted and modified the incorrect rows of the ‘track’ table on slaves.

How are we fixing it

In order to ensure that no slaves continue to replicate incompatible changes, we are incrementing the schema number again to 18, which will force operators of slave servers to intervene appropriately. To ensure that slaves have a correct version of the ‘track’ table, we are providing an upgrade script that will download an exported snapshot of the production server’s ‘track’ table at a known point and import it, as well as correct some smaller issues. By importing this snapshot, slave servers will be reset to a correct version of this table and replication can continue.

Specific step by step instructions on running this upgrade will be in a separate blog post. Watch this space!

What problems may have arisen

  1. In the unlikely case an external program directly references track row ID numbers, or if it uses the newly-added track MBID field (the ‘gid’ column), these will not be correct if they were taken from any server but the production server. If an application stores either of these identifiers in any way, that data should be rebuilt.
  2. Due to the compounding problems from replication, some tracklists will have incorrect information — missing tracks, misnumbered tracks, links to the wrong recordings, wrong durations, and/or wrong track artist credits. Information of this sort that was derived from replicated slaves during the affected period should be regenerated after upgrading.

FAQ about this update

Q: We don’t use the track table, we use recordings. Am I affected?
A: You are not affected if you use recordings directly, i.e., looking up recording information by a stored recording MBID, except if you use track information linked to those recordings (for example, if you create a list of releases a given recording appears on). Since the link between the recording and the release tables is via the ‘track’ table, anything that connects these two entities is likely to be affected.

Q: How can I tell if any of the tracklists I am using are affected?
A: Due to the random permutation issue, it’s not completely possible to be 100% sure. However, it’s possible to know of tracklists that definitely have problems by two means: track counts, and sequence issues. The former can be tested with a fairly simple query: “

SELECT medium.id, medium.track_count, count(track.id) as track_track_count,
     medium.track_count  count(track.id) AS counts_differ
FROM medium join track on track.medium = medium.id
GROUP BY medium.id, medium.track_count
HAVING count(track.id)  medium.track_count;

Any medium that appears in that query has been affected and its tracklist should not be trusted (select ‘medium.release’ to get release IDs, if that’s your jam). Sequence issues are a more complex query:

SELECT distinct m.id FROM
  (SELECT DISTINCT medium.* FROM
    ( SELECT track.medium, min(track.position) AS first_track, max(track.position)
      AS last_track, count(track.position) AS track_count, sum(track.position)
      AS track_pos_acc
      FROM track
      GROUP BY track.medium) s
    JOIN medium ON medium.id = s.medium
    WHERE first_track != 1 OR last_track != s.track_count OR
        (s.track_count * (1 + s.track_count)) / 2  track_pos_acc
    ) m

(note: if you only get 10 rows for this query, you’re fine — they’re these ten, which are known problems)

For more safety, don’t trust anything in the track table that’s been updated since the schema change:

SELECT distinct medium
FROM track
WHERE last_updated > ‘2013-05-15’

If it’s possible in your application, it’s probably best to throw out any updates to tracklists since 2013-05-15.

Again, we’re sorry for the trouble this update may have caused you!

Search server regressions fixed

Yesterday we pushed out a new version of our search servers to fix some regressions introduced last week. Thanks to Paul Taylor for fixing these bugs so quickly.

Release Notes – MusicBrainz Search Server – Version 2013-20-05

Bug

  • [SEARCH-290] – REGRESSION WS2 RECORDING query returns cropped artist-credit
  • [SEARCH-294] – REGRESSION:Search results no longer include medium-list count attribute
  • [SEARCH-298] – REGRESSION:ws/1 release search seems broken

Improvement

  • [SEARCH-296] – Update README to point to up-to-date mmd-schema repository

Search server release: 2013-05-15

Coninciding with our main server release, we’ve updated our search servers. This version fixes some bugs from the last release and adds support for countries and track ids.

Thanks for your hard work on this release, Paul!

Release Notes – MusicBrainz Search Server – Version 2013-05-15

Bug

  • [SEARCH-236] – Incomplete VA artist credit included for releases in recording search
  • [SEARCH-282] – REGRESSION:Johanne Sebastian Bach is not the first result when search for artist Bach
  • [SEARCH-283] – REGRESSION:"-" is returned instead of an empty list when there are no ISWCs for a work
  • [SEARCH-284] – REGRESSION:"-" is returned instead of an empty list when there are no ISRCs for a recording

Improvement

  • [SEARCH-219] – Include alias sortnames when searching labels or artists
  • [SEARCH-257] – entity search : entity name should have more weight than aliases and artist credits
  • [SEARCH-268] – Add extended alias info to the ws search results
  • [SEARCH-269] – WS searches don’t return aliases that match the artist name

Task

  • [SEARCH-274] – Support for changes to Countries in forthcoming Schema release
  • [SEARCH-285] – Support for TrackIds in forthcoming Schema Release

Schema change release tomorrow 15 May, 1300UTC

Things have been quiet at MusicBrainz in the past few weeks, but don’t let this fool you! We’ve been working hard on getting the schema change release put together. We’re on schedule for releasing this tomorrow at 13:00 UTC, 15 May, 2013.

We’re going to start the release process at 1300UTC, but the site may not go down just yet then. We’ll get started once we have all of our ducks in a row — to get more updates from us before we start, please follow @musicbrainz on Twitter or join us in IRC at #musicbrainz on irc.freenode.net.

Thanks, and wish us luck for a smooth schema change tomorrow!

Search server fixes released

Last week’s search server release had some bugs that we decided should be fixed sooner than later. Paul Taylor rose to the challenge and fixed 4 important bugs and we just finished releasing the updated code. Thanks for your efforts, Paul!

Release Notes – MusicBrainz Search Server – Version 2013-04-04

Bug

  • [SEARCH-279] – Seach server returning wrong results
  • [SEARCH-280] – Artist search DAVID BOWIE → FRANZ SCHUBERT (score 100) !? Bowie (score 0)
  • [SEARCH-281] – If set explain=true option with dismax search it actually does a non-indexed search

Improvement

  • [SEARCH-267] – Create new rewrite method for Dismax FuzzySearch

Updated search server now live

We’ve just updated our search servers to the latest version. Thanks to Paul Taylor for his long hard work porting our code to Lucene 4.1. Big thanks also to Murdos and Nikki for helping with this release!

Also, we’ve installed a small gigabit ethernet switch for our search server cluster so that we can move new search indexes around much faster. Hopefully we will see indexes updating in just over 2 hours from now on.

Read on for all the details of this release:

Release Notes – MusicBrainz Search Server – Version 2013-03-29

Bug

  • [SEARCH-239] – Search updater doesn’t update the index last-updated value returned by SEARCH-232
  • [SEARCH-244] – Since the October Schema Change Release, search server is now returning empty join phrases when once doesnt exist , whereas before it didn’t display it all
  • [SEARCH-247] – Runtime Exception: The property or field count on the class org.musicbrainz.mmd2.Medium$TrackList is required to be included in the propOrder element of the XmlType annotation.
  • [SEARCH-263] – Search server applications do not gracefully disconnect from PostgreSQL on termination

Improvement

  • [SEARCH-217] – Allow searching and displaying of folksnomy tags for the release entity
  • [SEARCH-246] – Extend support for searching for blank parameters to ISWC and ISRC

New Feature

  • [SEARCH-228] – Let Dismax Search for Labels search Label Code

Task