MetaBrainz Blog – Page 68 – MetaBrainz Foundation Community Blog

Important Update on Replication

We’ve been informed that some people have noticed their MusicBrainz slave servers have stopped applying replication packets. We’ve tracked the problem down and now have a fix for this problem.

How Do I Tell If I’m Affected?

To determine if you are affected, connect to the MusicBrainz PostgreSQL database (we recommend using ./admin/psql from your musicbrainz-server checkout) and run the following:

SELECT now() - last_replication_date FROM replication_control;

If you see a date that is larger than a few days, there’s a good chance you are affected by this bug and should upgrade.

How Do I Fix This?

Fixing this is easy, you simply need to update your Git checkout. We suggest the following:

cd musicbrainz-server
git fetch --tags origin
git checkout v-2013-10-14-replication-fix-2

EDIT: formerly used a different (broken) tag and recommended a different method of fetching tags.

When replication next runs, the data that cannot be applied will be rewritten and replication will continue as normal.

Using cakes for social engineering

For the past few years we’ve had some accounting difficulties with one of our customers: Amazon. I have no idea how their accounting and vendor systems work, but apparently we ended up in their system 4 different ways. And payment methods were horribly confused — it was a mess all around.

Invoice #144, for our Live Data Feed service that Amazon subscribes to, has been outstanding for almost three years now (it will be 3 years in January, but I wanted to get this money to come in in 2013). To be honest, there could be some confusion on our or on Amazon’s part — in fact the invoice may not be outstanding anymore. The fact of the matter is that we can’t seem to figure out what exactly is going on, but no money is flowing from Amazon to us and we’re owed somewhere around $20,000. (Which is near 10% of our annual budget incidentally)

For the last 6 months I’ve stepped up my pestering to get this resolved. I’ve been assured progress for the past 6 months, but nothing has happened. Promises of progress, then nothing. Again and again. I finally had an idea how to make change happen: Send Amazon an anniversary cake and post a picture of it publicly!

I even told my Amazon contacts about this idea, but it didn’t really catalyze anything. Then I finally set a deadline of Dec 2nd. The deadline came and went with more unfulfilled promises, so on December 2nd I picked up the phone and ordered a cake. Larsen’s Danish Bakery in Seattle were quite lovely to work with and created this cake for us:

Invoice #144 cake

A friend of mine went to the bakery, snapped this picture and then delivered the cake to Amazon HQ. It was accepted at the reception with promises that it would be delivered to its recipient. Then we started tweeting and Cory posted an entry to BoingBoing “Charity sends Amazon a cake celebrating 3d anniversary of unpaid invoice“.

For almost 24 hours nothing happened, but then I got email from my contact at Amazon telling me that the Accounts Payable team found a problem that was blocking payments from being sent to us, that the problem was now fixed and that they were investigating means to prevent this from happening again.

My contact goes on to say that a check will be cut tomorrow an overnighted to us. And that I should expect one more email telling me who on Amazon will be managing our relationship going forward. And, I have a voicemail pending from a person at Amazon’s accounts payable team to finally resolve the matter of the 3 year old invoice.

Sending this cake was quite effective! For $30 I managed to wake up Amazon, send a clear message that our account was not being managed well, that their AP team has some issues to address and that I wanted to fix our relationship. From where I stand, I see that these issues are on track for being resolved. Thanks for stepping up your game, Amazon!

Finally, I would like to say that all the people I’ve dealt with at Amazon have been polite and were honestly trying to help me. The real reason, from what I can tell, is that Amazon employees are constantly overworked and that MetaBrainz is such a small organization that its hard for them to really find the time to manage this relationship.

I’m glad that Amazon didn’t just cancel their contract with us and I’m looking forward to an improved working relationship going forward.

Musing: compacting replication packets

I’ve recently been working to write more about MusicBrainz internals, and thoughts about the project. Often this blog doesn’t see many posts, and most of them are on official topics like releases, so putting this here is an experiment. I hope you’ll all enjoy hearing about something a bit less concrete (and perhaps less dry, more technical) than usual!

How Replication Works

Replication is a pretty important part of MusicBrainz, though perhaps to the average user it’s a bit hidden. For those readers who aren’t familiar with it, replication is the mechanism behind the live data feed: users have a PostgreSQL database and using tools we provide (or some third-party alternatives) regularly download and apply so-called “replication packets” which describe the changes to the database in a specific period.

Replication packets are .tar.bz2 archives with a collection of files in them: COPYING with the license info, README with a very sparse description of replication, SCHEMA_SEQUENCE with the version of the database schema the replication packet applies to, REPLICATION_SEQUENCE with a sequence number that the code uses to apply replication packets in the correct order, TIMESTAMP with, well, a timestamp, and finally a folder mbdump, which contains two files: dbmirror_pending and dbmirror_pendingdata. Those of you who use the MusicBrainz data dumps may recognize this format: it’s the same as the data dumps. dbmirror_pending and dbmirror_pendingdata are two database tables that are used by replication to store the data about changes while those changes are being applied to the database.

Let’s look more closely at what those two tables contain. dbmirror_pending has columns for a sequence ID, a fully-qualified table name, an operation, and a transaction ID. dbmirror_pendingdata has columns for a sequence ID, a boolean for if data specifies keys only, and finally for the change data itself. Jointly, these two tables combine, conceptually, into an ordered list of operations to perform on the database. Since I tend to think in JSON, here’s a way you could imagine a single operation looking:

{"table": "musicbrainz.release",
"operation": "update",
"existing_row": "\"id\"='1290306' \"name\"='620308' \"artist_credit\"='234861' \"release_group\"='1269028' \"status\"='1' \"packaging\"='1' \"country\"='150' \"language\"='120' \"script\"=",
"update_row": "\"id\"='1290306' \"gid\"='e37dfeea-0f25-48fa-85c0-b4d174ff172d' \"name\"='620308' \"artist_credit\"='234861' \"release_group\"='1269028' \"status\"='1' \"packaging\"='1' \"country\"='150' \"language\"='120' \"script\"= \"date_year\"='2009' \"date_month\"= \"date_day\"= \"barcode\"='8715777007870' \"comment\"='' \"edits_pending\"='3' \"quality\"='-1' \"last_updated\"='2013-05-15 13:01:05.065623+00'"}

This operation would specify that it should update the table musicbrainz.release by taking the row whose id is 1290306, gid is e37dfeea-0f25-48fa-85c0-b4d174ff172d, name is 620308, etc. (as listed in ‘existing_row’) and change it to have id 1290306, gid e37dfeea-0f25-48fa-85c0-b4d174ff172d, name 620308, etc. (as listed in update_row).

Compacting replication packets

So there’s a summary of how replication works, in rough, abstract terms. Now on to the real topic of the post: making replication packets more compact. As the current system works, every change is put into the replication packet; that is, if in the course of an hour (one replication packet), a table changes twenty times, then twenty operations will end up in the replication packet. Sometimes this is useful: some data users use database-level triggers to update their own derived information, and sometimes that requires seeing every change, even if it changes again very soon thereafter. However, for most people, replication is just a way to get their database up to date every hour. For these people, having all twenty updates is wasteful — as far as they’re concerned, it could just be a simple update from how the row looked at the start of the hour to how it looked at the end. This becomes especially true for the (currently rather underused) larger replication packets (daily and weekly, specifically), which include more changes (and thus more changes to the same rows).

To formalize this idea a bit more, let’s make one more abstraction: a chain of operations. A chain is, informally, an ordered list of operations on the same data (or the same row). The simplest chain is just one operation, and from there we can work with chains in a variety of ways:

Chains can be combined: if the final state of a chain corresponds to the initial state of another chain, and the first chain’s final operation is before the initial operation of the second (without any other chains whose initial state corresponds to the first chain’s final state in-between), then the two chains can be combined into one chain.
Chains can be reordered: if there is no way to combine two chains by the above rule (including by way of intermediate operations or chains), then the two chains operate on completely separate data and thus can happen in any order. (For the database-savvy who might have noticed: slave databases, that use replication, don’t have foreign keys or other constraints which might make this not true).
Chains can be combined even more: If the final operation of a chain is a deletion, and the initial operation of another is an insertion, the two chains can be combined by turning the deletion + insertion into a single update. (As you might imagine, the ability to change the order of chains helps a lot here!)
Chains can be collapsed: perhaps most important! Any number of updates in a chain can be turned into a single update, from the initial state of the first update to the final state of the last update. Additionally, an insertion followed by an update can be turned into a single insertion, directly to the final state of the update. Additionally, an update followed by a deletion can be turned into a single deletion, directly from the initial state of the update. Finally, an insertion followed by a deletion can be ignored entirely, since it has no lasting effect on the database.

Thus, by creating, combining, reordering, and collapsing chains, we can make replication packets do many fewer operations (which, in turn, can make the packets smaller and have them apply faster). I’ve still glossed over the details of how this could be implemented, though, so to wrap things up, here’s a basic algorithm for optimizing packets:

Loop through operations in the order ProcessReplicationChanges (the script responsible for applying changes to the database from a replication packet) would: order the transactions in ascending order by the maximum sequence ID within the transaction, and within a transaction by ascending sequence ID.
Take action depending on the type of operation:
- If an insertion: see if the output already contains a deletion on the same table. If so, take the initial state of the deletion and the final state of the insertion and output an update instead of the deletion. If no such deletion exists, simply copy the insertion to the output.
- If an update: see if the output already contains an insertion or an update on the same data (that is, find an operation whose final state matches the initial state of the update). If you find one, replace it with an operation of the same type (and initial state, if applicable) but with the final state of the new update instead of what it previously included. If you don’t find one, just copy over the update to the output.
- If a deletion: see if the output already contains an insertion on the same data. If it does, remove it, add nothing to the output, and move on. If not, look for an update on the same data. If you find one, replace it with a deletion from the initial state of the update. If not, look for an insertion on the same table. If you find one, replace it with an update from the initial state of the deletion to the final state of the insertion. Finally, if you found nothing, copy the deletion to the output.
Dump the output as a replication packet. Transactions shouldn’t matter, so put them all in the same transaction.

Final notes, FAQs

Why isn’t this already being done? Daily and weekly packets are relatively new, and since some data users do want to see every single operation, it doesn’t make sense to do this to the hourly packets. It’s also somewhat annoying to get the operations in the right order, because they aren’t in the replication packet dump, due to how transactions have to be processed. The musicbrainz-server code gets around this by importing the data to a PostgreSQL table and letting the database do the work of putting things in the right order. mbslave instead loads the entire packet into memory and sorts it there, which obviously is potentially dangerous with a larger packet (which, as noted above, are more likely to derive value from this process). Altogether, the safest way to implement this would be to mimic musicbrainz-server’s process, but to do this on the production servers it would need to use different table names so as to not interfere with the normal replication packet creation process. But mostly, because nobody’s written it yet.
How much benefit would this really bring? Simple answer: I don’t really know, but I know it’s some. Autoedits (additions, especially) have a tendency to produce multiple operations where one would do fine, because they first increment the ‘edits_pending’ column of whatever they’re editing in one operation, then in the process of applying the edit (automatically, and immediately, in the same transaction) decrement it. Editors also tend to change a bunch of different things about the same entity all at once, but sometimes not in one edit but rather in several. Of course, any deletion is easily counterbalanced by the many insertions that happen all the time in MusicBrainz; perhaps most notably, daily scripts run to clean up unused entities, so any packet that includes those changes will probably be able to collapse some insertion + deletion pairs. So there’s several cases where obvious chains would appear already.

Hopefully those of you who’ve made it down this far found this enlightening — or, at least, interesting. At some point in the future this might be something we do with the daily and weekly replication packets we’re already creating (currently just by concatenating together hourly packets), but for now there’s no solid plan to do so. Thus, just musing for now!

Thanks for reading!

Server update, 2013-11-25

Hello again! We’ve got another freshly-pressed release of musicbrainz-server, just sent out to our agents in the field. Thanks to JesseW, Freso, and nikki for supplementing my and ollie’s work this round.

Some things you might be excited about this release:

For artists with only standalone recordings, they’ll now be shown on the overview tab, similar to artists with only VA releases (thanks JesseW!)
The newly-added Bandcamp relationship for artists and labels should now clean up and autoselect the type (thanks Freso)
Images in Wikimedia Commons will now appear on artist and place pages (thanks nikki!)
Place coordinates now support a few more formats, to support comma as a decimal separator and to support the format used on the Japanese Wikipedia.
You can now provide a list of statistics to the timeline graph by separating the raw statistic names with ‘+’. For example, showing the pace of addition of geonames URLs to areas: https://musicbrainz.org/statistics/timeline/count.area+count.ar.links.l_area_url.geonames#+r-

Otherwise, a variety of bug fixes and small improvements. We hope you like it! The tag for this release is v-2013-11-25, which this week I’ve remembered to push to both github and bitbucket.

Bug

[MBS-6529] – “An error occured while loading this edit” for an old edit
[MBS-6888] – XML Web Service omits ASINs in output for Releases which have ASINs assigned
[MBS-6902] – ISE: Caught exception in MusicBrainz::Server::Controller::Artist->edits “Can’t call method “id” on an undefined value at lib/MusicBrainz/Server/Data/Utils.pm line 410, line 3.”
[MBS-6924] – Merging places does not copy the address to the target
[MBS-6925] – IMDb artist autoselect shouldn’t block company
[MBS-6935] – Cover art uploader incorrectly falls back to old uploader in Safari
[MBS-6947] – Release inline credits broken for recording-recording relationships when both recordings are in the release
[MBS-6951] – tracks with recording pending edits are not marked as edit pending any more in release pages
[MBS-6956] – Medium title(s) not shown on release pages
[MBS-6960] – AC bubble not linking to the artists
[MBS-6961] – AC not shown on tracks even though it differs from release AC
[MBS-6963] – Release editor incorrectly claims Various Artists has been used for tracks
[MBS-6966] – Edit/remove relationship links shown on release page when not logged in
[MBS-6980] – inline search : oversized width outside of screen
[MBS-6986] – Match recordings by MBID not working any more

Improvement

[MBS-1754] – Display standalone recordings on overview if artist has no release groups
[MBS-6400] – Display on release merge edit the same info displayed on merge preview page (release events/labels/catalog numbers etc)
[MBS-6755] – Create whitelist and display images in the sidebar
[MBS-6891] – “Add Label”/”Add Event” links/buttons need cursor:pointer
[MBS-6899] – Accept coordinates using a comma as a decimal point
[MBS-6929] – Support coordinates from the Japanese Wikipedia
[MBS-7011] – Allow specifying a list of arbitrary lines to graph in the timeline graph.

New Feature

[MBS-6998] – Add autoselect for Bandcamp URLs for Labels and Artists

Task

[MBS-6920] – Add muzikum.eu to the lyrics whitelist

Update: an earlier version of this post failed to include MBS-6755

Venue and Studio Support: Introducing Places

MusicBrainz now supports venues and studios via our new “place” entity!

This was one of our Google Summer of Code projects for this year and many thanks to Nicolás Tamargo for his work on it. We released his work a few weeks ago and after a few initial hiccups, it’s looking good and we want to let you all know about it. 🙂

So what can we do with places?

The most obvious thing we can do now is store information about recording, mixing and mastering locations.

For example, the studios listed in the credits for Universe by Kyoko Fukada:

and the venue for the recordings on Live in Cartoon Motion by Mika:

We can of course link the place to a variety of external sites, as can be seen in the list of URLs for Wembley Arena:

Some places are made up of several parts. In those cases, we can link one place as being part of another. For example the various studios at Abbey Road Studios:

or the hall and theatre of the Barbican Centre:

We were already able to add engineers to the database as artists, now we can also say which studio they work at, as seen here for the studio Railroad Tracks:

Many orchestras and sometimes other artists have a home venue where they perform on a regular basis. These can now be linked, like in you can see for the Barbican Centre: Barbican Hall:

A premiere is sometimes held for a work and now we can link those works to where the premiere was held, e.g. the following works which were premiered at Carnegie Hall:

The place can also have coordinates, which make it possible to pinpoint the location on a map. The MusicBrainz website doesn’t show any maps at present, but here’s a map of all places with coordinates by Mineo:

Events?

No, we do not yet support events.

Thanks to nikki for writing this post.

Annual report for 2012 finally posted

I finally completed the 2012 annual report! This year has been busy, so I apologize for finishing it this late.

Our cost per 1M web hits dropped significantly, we finished our first year in the red and we created 1/4 of all of our edits to date in 2012! Go read the report to find out who was the top editor, the top voter and other interesting tidbits about MusicBrainz in 2012.

Thanks to Navap, Nikki and Reosarevok for helping in putting this report together!

Server update, 2013-11-11

Another fortnight, another release; thanks to Freso, warp, bitmap, reosarevok, and the MusicBrainz team for their work!

The tag for this release is v-2013-11-11. We had some small problems during release; sorry to anyone who ran into an error during the short period before we reverted things to get our server configurations in order!

Bug

[MBS-4438] – Release editor: Track durations are not loaded the first time you access the recordings tab
[MBS-5592] – Relationship editor permits multiple identical relationships
[MBS-6066] – Random internal server errors when searching
[MBS-6298] – ‘View all relationships’ links to a tab that’s not in the list of tabs
[MBS-6449] – Logic for showing “at least” in the edit search for the number of results is wrong
[MBS-6661] – work_attribute check is wrong
[MBS-6673] – Cover art uploading not working in IE
[MBS-6689] – Pasting an MBID initiates a search
[MBS-6769] – Inline search shows sort name even when identical to name
[MBS-6785] – Tagger button broken in Opera
[MBS-6851] – Can’t relate recordings to places in the relationship editor
[MBS-6858] – Relationship type documentation not accessible
[MBS-6872] – Tags page does not show places
[MBS-6873] – delete_unused_url and delete_orphaned_recordings don’t account for places
[MBS-6878] – Inline search check for non-latin characters treats Vietnamese characters as non-latin
[MBS-6883] – Relationship editor fails to load existing relationships
[MBS-6884] – Guess case treats “studio” in place names as extra title information
[MBS-6892] – Relationship editor needs “(more documentation)” link after relationship type description
[MBS-6900] – Cannot edit places with empty coordinates
[MBS-6901] – Places lat/long parser does not understand »+55° 54′ 14.49″, +8° 31′ 51.64″«
[MBS-6905] – beta: Comma shown in coordinates field when editing places with no coordinates
[MBS-6907] – beta: Coordinates parsing should not require seconds
[MBS-6908] – Internal server error searching for multiple editor flags in the edit search
[MBS-6909] – “Editor flag is not” in edit search does not work
[MBS-6914] – beta: Clicking result in inline search hides popup instead of selecting result
[MBS-6916] – beta: No dropdown in inline search when there are no results
[MBS-6921] – Artist Credit join phrase not displayed in tracklist

Improvement

[MBS-1421] – Require an edit note for all destructive edits
[MBS-2985] – Report: download relationships in non-digital releases.
[MBS-6239] – Use Wikidata URLs to fetch interwiki links
[MBS-6353] – Display on release group merge edit the same info displayed on merge preview page (mainly release group types)
[MBS-6394] – Automatically cut out hyphens during ISRC addition
[MBS-6456] – Show country and subdivision when displaying areas in sidebar and on profile pages
[MBS-6554] – Uppercase letters when entering ISRCs instead of rejecting lowercase ones
[MBS-6771] – Use localised aliases in the inline search where possible
[MBS-6824] – Coordinate fields should understand and convert degree minute second format
[MBS-6828] – Coordinates should be editable as one field
[MBS-6830] – Coordinate fields should strip degrees signs
[MBS-6831] – Coordinate fields should understand directions
[MBS-6832] – Coordinates should be presented better
[MBS-6838] – Release group dropdown in add release does not contain sufficient info
[MBS-6839] – Video attribute should be shown in merge recording edits
[MBS-6846] – Strip excessive digits in coordinates
[MBS-6859] – Inline search for places should show localised aliases
[MBS-6882] – Titles for video pages should say “Video”
[MBS-6911] – Make MB.Control.ArtistCredit a view model for knockout, use $.widget for MB.Control.Autocomplete

Task

[MBS-6729] – Add whosampled to the Other databases whitelist
[MBS-6893] – Add “Rockens Danmarkskort” to “Other databases” whitelist for Places

The BBC unveils a service that tweets to artists when their music is played…

… and it is built using MusicBrainz data! Michael Smethurst, a good friend of MusicBrainz, hacked up this service in the space of 2 days and writes:

The original idea came from a friend whose music occasionally gets played on Radio 1, 1xtra and 6Music. Almost always he missed this and either found out later from a friend or never found out at all. But he does use various bits of social media (including Twitter) to make contact with fans and promote his releases and live appearances.
. . .
To power online music services such as BBC iPlayer Radio, Playlister and /music the BBC uses metadata provided by MusicBrainz, a community maintained music encyclopedia. If you use Twitter and you’re a music artist or an agent or a publicist or similar and would like now playing notifications you need to check that your Twitter account handle is in MusicBrainz.

Thanks for creating such an awesome service, Michael. I know MusicBrainz contributors love how the BBC uses their data — I wish more people made such creative use of our data!

Fire damages the Internet Archive

A fire at the Internet Archive (our friends!) has caused $600,000 in damage. Fortunately no one was harmed and no data was lost:

A fire at the Internet Archive’s San Francisco scanning center has destroyed an estimated $600,000 of digitization and scanning equipment. Fortunately no one was injured in the blaze, but the property damage has ruined “some physical materials” that were yet to be digitized, and restricted the nonprofit organization’s ability to record the history of the web.

MusicBrainz just donated $50 to the Internet Archive and asks you to consider making a donation as well.

VMWare image of 2013-10-14 release available

I’ve released the VMWare version of the 2013-10-14 schema change release:

BitTorrent: musicbrainz-server-vmware-2013-10-14.ova.torrent
Direct download: musicbrainz-server-vmware-2013-10-14.ova

This VM is 8.7GB large and is built on the latest version of VMWare (Fusion in my case). If you’re using a VMWare product, then use this image.

The documentation for this VM has been updated to reflect the latest changes. Please make sure to read this page while you wait for your download!