December 2013 – MetaBrainz Blog

New editing feature: adding new entities from inline search fields

As mentioned in the blog post for the 2013-12-23 server release, a useful new feature for editors is the option to add new entities directly from inline search fields:

There’s an “Add a new [entity]” button at the bottom of (most) search result menus, which upon clicking will open up a dialog in the page. The dialog contains a form identical to the one you’re used to when creating entities the old-fashioned way — that is, from the Editing menu at the top of the website.

Once you’ve successfully added a new entity from the dialog, it’ll be automatically loaded into the search field you started from.

The visible exceptions to this feature are that you can’t add new areas (only location editors can add those), or releases (because those take a lot longer to add, and it wouldn’t be useful to do so in a small dialog that you can accidentally close), and finally, you can’t spawn an add-entity dialog within another add-entity dialog. 🙂

Another change that went along with this feature (that editors should be aware of) is that we now require artists and labels to be selected on the “Release Information” tab of the release editor.

Previously, you could enter plain text into these fields, proceed to the next tab without selecting a search result, and handle creation on the “Add Missing Entities” tab. Because “Add Missing Entities” has a very limited UI, in the future we’d like to remove it in favor of easier entity-creation on the other tabs. This is a small step towards that. Note that you can still use “Add Missing Entities” for track artists — this change only affects the release artist and release labels on the first tab.

Don’t hesitate to report any bugs or suggestions about this feature to our issue tracker: http://tickets.musicbrainz.org/

All MusicBrainz sites downtime

On Sunday, December 29th at 1pm PST, (2pm AZ, 4pm EST, 9pm UK, 10pm CET) we’re going to swap out our network switch. During this time all MusicBrainz sites hosted in California will be unavailable. (that is all sites, save for the primary and secondary FTP mirrors and the FreeDB gateway).

The work will not start exactly at 1pm, but we’re doing to start executing our plan at 1pm. The exact time for the outage will be announced via Twitter and via the banner on musicbrainz.org

We hope that this outage will last only 10-15 minutes, but as these things typically go, you’ll never know how long it will really take.

Sorry for the inconvenience.

Server update, 2013-12-23

Another two weeks and once again, a nice pile of fixed bugs! Thanks to uk/chirlu and reosarevok for fixing a few bugs along with the MetaBrainz team.

Some bits that might interest you:

It’s now possible to get an add-entity dialog right from the inline search fields. This should make it much easier to add relationships, especially, as well as creating new artists and labels where needed in the release editor. This also ends up improving the already-existing dialog for creating works in the relationship editor, to include the full functionality of the add work page. We’re hoping to get a fuller blog post on this feature up in the coming days.
The artist overview now filters out many release groups by default, for a less cluttered view that should more reliably include only the main discography of bigger artists, rather than including a lot of bootlegs, concert recordings, and other releases that aren’t always useful to see. Of course, there’s still a button to see all release groups.
Perhaps less obvious than other changes, but a lot of colors and some other minor styles have changed slightly on the site, as part of a reorganization of our CSS files.

… and, of course, the usual crop of bugs and small improvements. There’s a full list below. The tag for this release is v-2013-12-23.

Bug

[MBS-5876] – Internal server error requesting a non-existent collection via the webservice
[MBS-6547] – Relationship editor popups go out of the screen if titles are too long
[MBS-6556] – Pressing enter doesn’t submit create work dialog in relationship editor
[MBS-6572] – label/edit_form claims /Label_Comment is a guideline
[MBS-6772] – regression : can’t reorder more than one cover art in the same edit
[MBS-6867] – Incorrect help text ‘You must select an existing release group. If you wish to move this release, use the “change release group” action from the sidebar.’
[MBS-6928] – Creating release via CD TOC is broken (for Various Artists releases)
[MBS-6977] – Relationship editor regression: track artists no longer displayed
[MBS-6990] – beta : clicking in a release opens a recording in a new window
[MBS-7025] – Labels not being shown in merge releases edits
[MBS-7057] – Internal server error when attempting to do merges with different entities simultaneously
[MBS-7058] – /relationship-editor endpoint ISEs if you don’t include “ended” fields
[MBS-7061] – The tracklistings are gone in the release view.
[MBS-7069] – beta: Buttons wrongly positioned on edit note tab of release editor
[MBS-7077] – beta: sizing issues with dialogs/autocompletes in relationship editor

Improvement

[MBS-1467] – Allow new artists to be created directly from the “Relate to…” search box
[MBS-3420] – Allow creating new release groups when editing existing releases via the release editor
[MBS-5313] – Add official (default) and unofficial displays for Overview
[MBS-5446] – The work creation widget should have a “Guess Case” part
[MBS-5515] – Allow adding artists in the relationship editor
[MBS-6428] – Move to LESS for more structure in the stylesheets
[MBS-6437] – button :active state only fires for <a> elements
[MBS-6583] – Make cache prefix names for types more obvious
[MBS-6879] – Add reports of entities with deprecated relationships
[MBS-6926] – Remove %E2%80%8E from the end of URLs
[MBS-7052] – Set cover art page says “No releases have cover art” when it’s not true
[MBS-7066] – Load area codes in initial load, instead of separately.
[MBS-7067] – Use rel=nofollow on external links.

New Feature

[MBS-5442] – The work creation widget should allow to create a work with an ISWC

Task

[MBS-7055] – Fix the spelling of eligble_for_cleanup

MusicBrainz Meetup: Chicago, IL, USA, 25-26 January 2014

In case the name didn’t tip you off, this is rather more casual of a get-together than our usual summits, but for those of you with the inclination, a free weekend and a decent way of getting to Chicago: we discovered that our fearless leader Rob was going to be in the same city as one of our developers (bitmap) and figured we’d fly me (the other developer) in too and make a thing of it! We’ll be hanging out in-person through January 25-26, plus probably part of the evening of January 24th, and we’d love to have you join us.

We don’t have much by way of details, at present, in part because this is quite informal. However, if you’re interested, we have a wiki page with arrival times for those of us with plane tickets already, and which we’ll update with any other plans we end up making. If you’d like to come, please add yourself!

MusicBrainz Read Only Between 2PM – 3PM UTC

We need to do essential maintainance on the main MusicBrainz database server today (specifically upgrading to Ubuntu 12.04 LTS). We’re aiming to get this work completed within an hour, and will require MusicBrainz to be in read-only mode for the duration of the upgrade. We’re going to go read-only from 2PM UTC (6AM PST, 9AM EST, 3PM CEST) to begin this work.

Sorry for the inconvenience!

Server update, 2013-12-09

Hello! It’s been another two weeks and it’s time for another release of musicbrainz-server. Thanks to intgr, Freso, and bitmap for their contributions to this release alongside ocharles and I. Congratulations also to bitmap, who becomes part of the official MetaBrainz team today.

This release is primarily bug fixes and small tweaks, though perhaps the most visible change is that Places and Works now have custom entity icons like most of our other entities already did.

The git tag for this release is v-2013-12-09.

Bug

[MBS-6932] – Wikipedia extracts lack licensing information
[MBS-6949] – Empty edits being submitted
[MBS-6954] – Artist disambiguation comments not always shown within tracklists
[MBS-6970] – Internal Server Error (ISE) on overview with wikipedia/wikidata ARs
[MBS-6978] – can’t close inline search
[MBS-6983] – Merge release edit produces incorrect UI display
[MBS-6984] – “Need an edit note” on merge deselects target entity
[MBS-6987] – regression: <span class=”name-variation”> has disappeared from tracklists
[MBS-6996] – RG merge edit shows disambiguation twice
[MBS-7000] – On /edits page, “merge release groups” table displays incorrectly
[MBS-7005] – Place disambiguation not being shown in the inline search
[MBS-7024] – Some recent replication packet breaks slave servers with collate indexes
[MBS-7026] – Wikipedia abstract is incorrectly in Chinese

Improvement

[MBS-2713] – Work has no musicbrainz style icon
[MBS-6710] – Match Mora and Recochoku URLs with “purchase for download” relationship type
[MBS-7036] – Improve alignment of old/new fields in some edit diffs

Amazon cake update

Yesterday evening I had a call with my contacts at Amazon and a person from the Accounts Payable department. Over the last two days they were able to work out the kinks in their accounting with regard to the MetaBrainz Foundation.

They verified that the outstanding invoices from our perspective were correct, including the now infamous invoice #144. Shortly after the call, a check for $22,500 was cut and will arrive in California by 10:30am today. I’ve also received a complete history of all of the payments made to the MetaBrainz Foundation and I’m happy to say that everything looks great to me.

Also, an issue surrounding this invoice was pointed out to me: If invoice #144 wasn’t outstanding, all of this would’ve been a bit drastic. I agree with that, but what I failed to mention in my last blog post was that if invoice #144 wasn’t paid, then invoice #200 (which is 3 months younger than #144) was outstanding. The matter at hand was that there was a 3 year, or a nearly 3 year, old invoice outstanding. Personally, I really wanted to gain clarity around what happened nearly 3 years ago and finally put the issue to bed.

Finally, I would like to commend Amazon on how they handled matters in the last week. When prodded hard enough, Amazon got their act together, got to work and figured out this mess and then swiftly cut a check. I’ll keep posting updates about this until the check is in the bank and has cleared, but it certainly looks like we’re heading to a complete resolution of this matter early next week.

Thanks Amazon!

Important Update on Replication

We’ve been informed that some people have noticed their MusicBrainz slave servers have stopped applying replication packets. We’ve tracked the problem down and now have a fix for this problem.

How Do I Tell If I’m Affected?

To determine if you are affected, connect to the MusicBrainz PostgreSQL database (we recommend using ./admin/psql from your musicbrainz-server checkout) and run the following:

SELECT now() - last_replication_date FROM replication_control;

If you see a date that is larger than a few days, there’s a good chance you are affected by this bug and should upgrade.

How Do I Fix This?

Fixing this is easy, you simply need to update your Git checkout. We suggest the following:

cd musicbrainz-server
git fetch --tags origin
git checkout v-2013-10-14-replication-fix-2

EDIT: formerly used a different (broken) tag and recommended a different method of fetching tags.

When replication next runs, the data that cannot be applied will be rewritten and replication will continue as normal.

Using cakes for social engineering

For the past few years we’ve had some accounting difficulties with one of our customers: Amazon. I have no idea how their accounting and vendor systems work, but apparently we ended up in their system 4 different ways. And payment methods were horribly confused — it was a mess all around.

Invoice #144, for our Live Data Feed service that Amazon subscribes to, has been outstanding for almost three years now (it will be 3 years in January, but I wanted to get this money to come in in 2013). To be honest, there could be some confusion on our or on Amazon’s part — in fact the invoice may not be outstanding anymore. The fact of the matter is that we can’t seem to figure out what exactly is going on, but no money is flowing from Amazon to us and we’re owed somewhere around $20,000. (Which is near 10% of our annual budget incidentally)

For the last 6 months I’ve stepped up my pestering to get this resolved. I’ve been assured progress for the past 6 months, but nothing has happened. Promises of progress, then nothing. Again and again. I finally had an idea how to make change happen: Send Amazon an anniversary cake and post a picture of it publicly!

I even told my Amazon contacts about this idea, but it didn’t really catalyze anything. Then I finally set a deadline of Dec 2nd. The deadline came and went with more unfulfilled promises, so on December 2nd I picked up the phone and ordered a cake. Larsen’s Danish Bakery in Seattle were quite lovely to work with and created this cake for us:

Invoice #144 cake

A friend of mine went to the bakery, snapped this picture and then delivered the cake to Amazon HQ. It was accepted at the reception with promises that it would be delivered to its recipient. Then we started tweeting and Cory posted an entry to BoingBoing “Charity sends Amazon a cake celebrating 3d anniversary of unpaid invoice“.

For almost 24 hours nothing happened, but then I got email from my contact at Amazon telling me that the Accounts Payable team found a problem that was blocking payments from being sent to us, that the problem was now fixed and that they were investigating means to prevent this from happening again.

My contact goes on to say that a check will be cut tomorrow an overnighted to us. And that I should expect one more email telling me who on Amazon will be managing our relationship going forward. And, I have a voicemail pending from a person at Amazon’s accounts payable team to finally resolve the matter of the 3 year old invoice.

Sending this cake was quite effective! For $30 I managed to wake up Amazon, send a clear message that our account was not being managed well, that their AP team has some issues to address and that I wanted to fix our relationship. From where I stand, I see that these issues are on track for being resolved. Thanks for stepping up your game, Amazon!

Finally, I would like to say that all the people I’ve dealt with at Amazon have been polite and were honestly trying to help me. The real reason, from what I can tell, is that Amazon employees are constantly overworked and that MetaBrainz is such a small organization that its hard for them to really find the time to manage this relationship.

I’m glad that Amazon didn’t just cancel their contract with us and I’m looking forward to an improved working relationship going forward.

Musing: compacting replication packets

I’ve recently been working to write more about MusicBrainz internals, and thoughts about the project. Often this blog doesn’t see many posts, and most of them are on official topics like releases, so putting this here is an experiment. I hope you’ll all enjoy hearing about something a bit less concrete (and perhaps less dry, more technical) than usual!

How Replication Works

Replication is a pretty important part of MusicBrainz, though perhaps to the average user it’s a bit hidden. For those readers who aren’t familiar with it, replication is the mechanism behind the live data feed: users have a PostgreSQL database and using tools we provide (or some third-party alternatives) regularly download and apply so-called “replication packets” which describe the changes to the database in a specific period.

Replication packets are .tar.bz2 archives with a collection of files in them: COPYING with the license info, README with a very sparse description of replication, SCHEMA_SEQUENCE with the version of the database schema the replication packet applies to, REPLICATION_SEQUENCE with a sequence number that the code uses to apply replication packets in the correct order, TIMESTAMP with, well, a timestamp, and finally a folder mbdump, which contains two files: dbmirror_pending and dbmirror_pendingdata. Those of you who use the MusicBrainz data dumps may recognize this format: it’s the same as the data dumps. dbmirror_pending and dbmirror_pendingdata are two database tables that are used by replication to store the data about changes while those changes are being applied to the database.

Let’s look more closely at what those two tables contain. dbmirror_pending has columns for a sequence ID, a fully-qualified table name, an operation, and a transaction ID. dbmirror_pendingdata has columns for a sequence ID, a boolean for if data specifies keys only, and finally for the change data itself. Jointly, these two tables combine, conceptually, into an ordered list of operations to perform on the database. Since I tend to think in JSON, here’s a way you could imagine a single operation looking:

{"table": "musicbrainz.release",
"operation": "update",
"existing_row": "\"id\"='1290306' \"name\"='620308' \"artist_credit\"='234861' \"release_group\"='1269028' \"status\"='1' \"packaging\"='1' \"country\"='150' \"language\"='120' \"script\"=",
"update_row": "\"id\"='1290306' \"gid\"='e37dfeea-0f25-48fa-85c0-b4d174ff172d' \"name\"='620308' \"artist_credit\"='234861' \"release_group\"='1269028' \"status\"='1' \"packaging\"='1' \"country\"='150' \"language\"='120' \"script\"= \"date_year\"='2009' \"date_month\"= \"date_day\"= \"barcode\"='8715777007870' \"comment\"='' \"edits_pending\"='3' \"quality\"='-1' \"last_updated\"='2013-05-15 13:01:05.065623+00'"}

This operation would specify that it should update the table musicbrainz.release by taking the row whose id is 1290306, gid is e37dfeea-0f25-48fa-85c0-b4d174ff172d, name is 620308, etc. (as listed in ‘existing_row’) and change it to have id 1290306, gid e37dfeea-0f25-48fa-85c0-b4d174ff172d, name 620308, etc. (as listed in update_row).

Compacting replication packets

So there’s a summary of how replication works, in rough, abstract terms. Now on to the real topic of the post: making replication packets more compact. As the current system works, every change is put into the replication packet; that is, if in the course of an hour (one replication packet), a table changes twenty times, then twenty operations will end up in the replication packet. Sometimes this is useful: some data users use database-level triggers to update their own derived information, and sometimes that requires seeing every change, even if it changes again very soon thereafter. However, for most people, replication is just a way to get their database up to date every hour. For these people, having all twenty updates is wasteful — as far as they’re concerned, it could just be a simple update from how the row looked at the start of the hour to how it looked at the end. This becomes especially true for the (currently rather underused) larger replication packets (daily and weekly, specifically), which include more changes (and thus more changes to the same rows).

To formalize this idea a bit more, let’s make one more abstraction: a chain of operations. A chain is, informally, an ordered list of operations on the same data (or the same row). The simplest chain is just one operation, and from there we can work with chains in a variety of ways:

Chains can be combined: if the final state of a chain corresponds to the initial state of another chain, and the first chain’s final operation is before the initial operation of the second (without any other chains whose initial state corresponds to the first chain’s final state in-between), then the two chains can be combined into one chain.
Chains can be reordered: if there is no way to combine two chains by the above rule (including by way of intermediate operations or chains), then the two chains operate on completely separate data and thus can happen in any order. (For the database-savvy who might have noticed: slave databases, that use replication, don’t have foreign keys or other constraints which might make this not true).
Chains can be combined even more: If the final operation of a chain is a deletion, and the initial operation of another is an insertion, the two chains can be combined by turning the deletion + insertion into a single update. (As you might imagine, the ability to change the order of chains helps a lot here!)
Chains can be collapsed: perhaps most important! Any number of updates in a chain can be turned into a single update, from the initial state of the first update to the final state of the last update. Additionally, an insertion followed by an update can be turned into a single insertion, directly to the final state of the update. Additionally, an update followed by a deletion can be turned into a single deletion, directly from the initial state of the update. Finally, an insertion followed by a deletion can be ignored entirely, since it has no lasting effect on the database.

Thus, by creating, combining, reordering, and collapsing chains, we can make replication packets do many fewer operations (which, in turn, can make the packets smaller and have them apply faster). I’ve still glossed over the details of how this could be implemented, though, so to wrap things up, here’s a basic algorithm for optimizing packets:

Loop through operations in the order ProcessReplicationChanges (the script responsible for applying changes to the database from a replication packet) would: order the transactions in ascending order by the maximum sequence ID within the transaction, and within a transaction by ascending sequence ID.
Take action depending on the type of operation:
- If an insertion: see if the output already contains a deletion on the same table. If so, take the initial state of the deletion and the final state of the insertion and output an update instead of the deletion. If no such deletion exists, simply copy the insertion to the output.
- If an update: see if the output already contains an insertion or an update on the same data (that is, find an operation whose final state matches the initial state of the update). If you find one, replace it with an operation of the same type (and initial state, if applicable) but with the final state of the new update instead of what it previously included. If you don’t find one, just copy over the update to the output.
- If a deletion: see if the output already contains an insertion on the same data. If it does, remove it, add nothing to the output, and move on. If not, look for an update on the same data. If you find one, replace it with a deletion from the initial state of the update. If not, look for an insertion on the same table. If you find one, replace it with an update from the initial state of the deletion to the final state of the insertion. Finally, if you found nothing, copy the deletion to the output.
Dump the output as a replication packet. Transactions shouldn’t matter, so put them all in the same transaction.

Final notes, FAQs

Why isn’t this already being done? Daily and weekly packets are relatively new, and since some data users do want to see every single operation, it doesn’t make sense to do this to the hourly packets. It’s also somewhat annoying to get the operations in the right order, because they aren’t in the replication packet dump, due to how transactions have to be processed. The musicbrainz-server code gets around this by importing the data to a PostgreSQL table and letting the database do the work of putting things in the right order. mbslave instead loads the entire packet into memory and sorts it there, which obviously is potentially dangerous with a larger packet (which, as noted above, are more likely to derive value from this process). Altogether, the safest way to implement this would be to mimic musicbrainz-server’s process, but to do this on the production servers it would need to use different table names so as to not interfere with the normal replication packet creation process. But mostly, because nobody’s written it yet.
How much benefit would this really bring? Simple answer: I don’t really know, but I know it’s some. Autoedits (additions, especially) have a tendency to produce multiple operations where one would do fine, because they first increment the ‘edits_pending’ column of whatever they’re editing in one operation, then in the process of applying the edit (automatically, and immediately, in the same transaction) decrement it. Editors also tend to change a bunch of different things about the same entity all at once, but sometimes not in one edit but rather in several. Of course, any deletion is easily counterbalanced by the many insertions that happen all the time in MusicBrainz; perhaps most notably, daily scripts run to clean up unused entities, so any packet that includes those changes will probably be able to collapse some insertion + deletion pairs. So there’s several cases where obvious chains would appear already.

Hopefully those of you who’ve made it down this far found this enlightening — or, at least, interesting. At some point in the future this might be something we do with the daily and weekly replication packets we’re already creating (currently just by concatenating together hourly packets), but for now there’s no solid plan to do so. Thus, just musing for now!

Thanks for reading!