Leaked email address incident: 2020-11-23

We’re saddened to write that we’ve let some of our users down by accidentally leaking their email addresses and birth dates via a bug in the web pages of musicbrainz.org. This caused some users to receive unwanted spam emails.

However, we would like to emphasize that no passwords, passwords hashes or any other bits of private user information other than email addresses and birth dates were leaked.

If you have never added or edited an annotation on MusicBrainz, then your email address and birth date were never leaked and you can ignore this — your data has not leaked.

What happened

About two weeks ago a MusicBrainz editor contacted us to say that their email address that was in use only at MusicBrainz had received spam. The user changed the email address to a very distinct email address in order to rule out a spammer guessing the updated email address. But it happened again, and the user received email to the unguessable email address. 

At this point we began an audit of the MusicBrainz server codebase in an attempt to find out where the leak was, patch it as soon as possible, and discover who was affected by it.

What we found

On 2019-04-26 we released a new version of the MusicBrainz server and in this version we added email addresses to the list of editor data we pass to our server to build MusicBrainz pages. The goal of this was to display them in admin-facing pages to, ironically, be able to fight spammers who were using MusicBrainz as a spamming tool. We also added the editor’s birth date, to be able to congratulate them on their birthday. Neither of these cases should have ever been a problem, since the private data should only be used on pages built and sent from our own server (where the data cannot be seen by anyone else), and any editor info sent to the users’ browser goes through a “sanitizing” process eliminating all this private information.

After some digging, we discovered that due to a bug we had overlooked in the code that stripped this data, the addresses and dates had started being sent to the browser whenever an entity page with an annotation was requested. The email address and birth date of the last person to have edited an annotation in MusicBrainz (any annotations, attached to any of our entities) was leaked on the page for the entities in question. This data was contained in a massive block of JSON data in the page source and was never shown on the web page for humans to see, which is why this issue went undetected for so long.

Who was affected

We looked at all editors who wrote any annotations that were displayed between the date the problematic code was released and the date the bug was fixed. This can mean either the annotation was written during this time period, or it was written before that but (being the latest version of the annotation for the entity) it was still displayed during this time period. This gave us a total of 17,644 editors whose data was at some point visible from the JSON block in at least one entity’s source code. We sadly do not have a way to know for sure how many of the affected were actually ever found and stored by spammers, since we attempt to block botnets as much as possible. As such, we simply have no way of knowing who was really affected by this leak — only who might have been.

What we’ve done

Once we detected the issue on November 22, we immediately put out a hotfix to all production (and beta) servers plugging the leak. The hotfix acted to sanitize the editor data by removing email addresses and birth dates from the JSON. We also deployed two additional changes that should help prevent similar issues from occurring, by avoiding sending sensitive editor data to our template renderer altogether. See all changes from the git tag v-2020-11-22-hotfix.

We are planning to improve our testing infrastructure to detect exposure of editor data — this will become a routine part of our continuous integration process. We are also going to ensure that any pull request dealing with editor data goes through a strict testing checklist.

How did spammers get these email addresses?

You might be wondering how such an obscure leak in a web page can end up in spammers finding and using your email — you’re not alone. 

Our sites are under near constant traffic from seemingly random internet bots fetching thousands of our pages in a day, with no apparent goal. All of our metadata is available for download, so why would someone download pages from us at random?

Well, we now know — web pages can contain a whole host of random data that shouldn’t be there. Email addresses, birth dates and such are just the starting point — there have been websites that have leaked credit card numbers and even login passwords, possibly compromising the integrity of user accounts.

In this case it appears that a botnet kept downloading pages from musicbrainz.org and driving the load on our servers up. We’ve been trying to block botnets ever since they’ve come into existence, but this is a laborious task that is never complete.

It appears that spammers used the botnet to scour the internet for private data such as emails to then send out lovely spam emails to all compromised users.

Summary

We would like to wholeheartedly apologize for this data leak. We take data privacy seriously and we aim to have high standards about privacy and data security. We find ourselves frustrated by the endless data leaks that happen on the Internet on a seemingly continuous basis and work hard to avoid committing these mistakes in our domain. However, we’re also human and we do make mistakes periodically. As explained above, we’re working to improve our systems and processes in order to prevent this from happening again.

We hope that you accept our most sincere apologies for this leak.

Robert Kaye, Michael Wiencek, Nicolás Tamargo and Yvan Rivierre

Reminder: Upgrading to PostgreSQL 12 on May 18, 2020

As we announced in February, in two weeks time (May 18, 2020) we’ll be upgrading our production database server to PostgreSQL v12 (from v9.5). At the same time, v12 will become the minimum supported version for MusicBrainz Server, so we ask that you upgrade afterwards as soon as possible! If you’re still unsure, a Q&A is below.

When do I need to upgrade my postgres by?

As soon as possible after May 18 if you’d like to keep your musicbrainz-server code up to date.

How do I perform the upgrade?

We’ll provide instructions closer to May 18. It’s recommended that you don’t upgrade until then, since we’ll be providing scripts to resolve some issues.

Will the live data feed (replication packets) stop working right away if I don’t upgrade?

No, as long as you keep your musicbrainz-server code checkout on the v-2020-05-11 tag (which will be the final release before May 18) or earlier. Future releases may work for a while too.

This is not a schema change release, so replication will continue to work smoothly until you upgrade. No tables or views will change.

However, to make the upgrade process smoother we’ll be dropping the musicbrainz-collate and musicbrainz-unaccent extensions, instead using PG’s builtin collation support for the former and replacing the latter with the unaccent extension from postgresql-contrib. A few SQL functions are being added to enable this, and some indexes need to be rebuilt. This will all happen as part of upgrade scripts we provide (or you can import from scratch). Some features of musicbrainz-server that use these old extensions may cease to work if you don’t apply them.

The extension changes above don’t actually make use of any new PG 12 features. We’ll avoid using such features for at least 1 month.

If I’m already running PostgreSQL 12, do I need to do anything?

Yes, but things will be easier for you. As mentioned in the previous answer, we’ll be dropping the musicbrainz-collate and musicbrainz-unaccent extensions to make the upgrade process smoother for pre-v12 instances. So you’ll only have to run some upgrade scripts we provide to replace those extensions and rebuild some indexes.

My host/distribution doesn’t have PostgreSQL 12 yet!

If you’re running Debian or Ubuntu, the PGDG maintains an APT repository with the latest versions. These are the same packages MetaBrainz uses in production.

Amazon RDS supports PostgreSQL 12 since March 31.

I absolutely cannot upgrade yet! What should I do?

You can stay on the v-2020-05-11 release of musicbrainz-server or earlier until then. Replication packets (i.e. the live data feed) will continue to work until the next schema change on that tag, but you’ll have upgraded to v12 by then, right?

Instead of performing a pg_upgrade and running these upgrade scripts you mentioned, can I just import fresh data dumps into a new v12 cluster?

Of course. Just make sure your musicbrainz-server git checkout is on the v-2020-05-18 tag (once that’s released) or later before performing the import. And keep in mind it may be slower than a direct upgrade.

Upgrading Postgres instead of schema change: 18 May, 2020

Hello!

We’ve long procrastinated upgrading our production Postgres installation and we’ve decided to forego a schema change upgrade and instead upgrade Postgres to version 12.x. (We will migrate to whatever the latest stable version in the 12.x series will be).

This means that on 18 May we will not make any changes to the MusicBrainz schema, but  we will have some amount of down-time and/or read-only time while we upgrade Postgres on our production servers. We haven’t sorted out all of the exact details of how we will carry out this database upgrade, but the date is now confirmed.

If you operate a replicated instance of the MusicBrainz database we STRONGLY urge you to upgrade your installation shortly after we upgrade the production servers. After this release our team may start using Postgres features not available in Postgres 9.5.x, which is our current production version.

As usual for our releases that impact our downstream users, we will post many more details closer to the date and once the migration is complete, we will post detailed instructions on how you can upgrade your own installation.

Please post any questions you may have!

Thanks!

Please nominate us for the Open Publishing Awards!

We’ve recently found out about the Open Publishing Awards::

The goal of the inaugural Open Publishing Awards is to promote and celebrate a wide variety of open projects in Publishing.

All content types emanating from the Publishing sector are eligible including Open Access articles, open monographs, Open Educational Resource Materials, open data, open textbooks etc.

Open data? That’s us! We’ve got a pile of it and if you like the work we do, why not nominate us for an award?

Thanks!

MusicBrainz Schema change upgrade downtime: 17:00 UTC (10am PST, 1pm EST, 19:00CEST)

Hi!

At 17:00 UTC (10am PST, 1pm EST, 19:00CEST) we will start the process of our schema change release. The exact time that we plan to start the change will depend on how long it takes to finish our preparations, but we expect it to be shortly after 17:00UTC.

Once we start the process we will put a banner notification on musicbrainz.org and we will also post updates to the @MusicBrainz twitter account, so follow us there for more details.

After the release is complete, we will post instructions here on how to upgrade your replicated MusicBrainz instances.

Freedb gateway: End of life notice, March 18, 2019

Many moons ago people clamoured for a way for them to use MusicBrainz via their old FreeDB (and others) enabled players. The hope was that this would be a short term solution as more players picked up MusicBrainz support so we created the FreeDB gateway that allowed old clients to use an ancient interface to look up CDs with MusicBrainz.

We’ve been maintaining this gateway for over 11 years now and recently we had a user asking questions about their new music player and that they were having a hard time getting it to work with our FreeDB gateway.

Wait, what? Someone is developing specifically for a stop-gap measure? Clearly the goal of FreeDB gateway has been misconceived and people are not treating it as a gateway to using the proper MusicBrainz API endpoints.

We’re no longer keen on supporting this gateway and have been having trouble finding volunteers to maintain it. Our internal staff has enough to do with our own duties and have no interest in further maintaining this and neither do I.

For these reasons the FreeDB gateway is going away in 6 months time; March 18 will be the absolute last day that the gateway will be functioning. Should something crash and the gateway experience problems before then, we’ll just kill the VM that the gateway is running on and call it a day.

11 years of temporary is enough — if you use this service, migrate to a proper MusicBrainz endpoint right now!

AcousticBrainz migration: We’re on!

We had to postpone the migration of AcousticBrainz last week since we ran out of time (our database is getting to be sizable!). We’ve migrated the bulk of our data and are now ready to move the last bits and call the move complete.

Downtime will start very soon — follow us on twitter for more detailed updates.

AcousticBrainz downtime: Migrating hosting to our other servers

Today we’re going to migrate the AcousticBrainz service from its standalone server that we’ve rented in the past few years to our shared infrastructure at Hetzner. We’ve been prepping for this move for a few weeks now and the actual process to follow has been used before, so we don’t expect the downtime to be more than 1 hour.

We’re sorry for the downtime that will be coming — to keep up with what we’re doing, please follow our progress on Twitter. We hope to start the migration in the next hour or two from when this blog post goes up.

 

GDPR compliance

The General Data Protection Regulation is a complex EU regulation that stipulates many points for protecting private data of users on the Internet. Even though this is an EU regulation, it has a worldwide impact due to the nature of the Internet. This regulation comes into effect today, May 25, 2018 and is the reason why so many companies have sent you mail in the past few weeks about updating their privacy policies.

The MetaBrainz Foundation with its collection of projects is also affected by this regulation. We’ve been learning and adapting our sites to be compliant with the regulation – sadly this regulation isn’t entirely black and white and there is an incredible amount of room left for interpretation of these rules.

The good news is that this regulation is roughly in line with our established practices: We’ve always held private information in a high regard and applied the sort of rules to ourselves as we wish to have our own private data treated. Luckily, this makes our compliance effort considerably easier. We’ve made two significant changes to how we treat your data and also adopted terminology as used in the GDPR in order to use the same languages that many other sites are now adopting. Please keep reading to find out the exact details of what we are doing to comply.

However, we do ask for your compassion and help in our process of complying with the GDPR. As we already mentioned, the GDPR is a complex set of rules that are not fully clarified yet. We’ve taken action on the steps that are clear to us and we’re following ongoing conversations on points that are in gray zones or unclear to us. We’ve made our best initial effort on compliance and promise to keep working on it as the picture becomes more clear. If you believe that we could improve our compliance, please contact us and let us know what we can do to improve. It would also help us if you could provide concrete discussion or examples to help us understand and take action on your suggestion.

Finally, below is the link to our GDPR compliance statement, implementing the regulations as we understand them and how they affect your data in our ecosystem. Where possible, we provide links for deeper understanding, links for you to examine our relevant code and links to tickets to follow the process of improving our compliance.

MetaBrainz’ GDPR Compliance Statement