Incident report: January 27th service outage

On January 27th, starting at 4:31UTC we were hit with increasing amounts of traffic from what appeared to be hundreds of different IP addresses mostly belonging to Amazon Web Service IP addresses. At 8:46UTC the inbound traffic overwhelmed our systems and brought nearly all of our services to a standstill.

After investigating the situation and receiving no meaningful assistance from Hetzner (our ISP who advertises DDoS mitigation services as part of their offerings) we blocked three subnets of IP addresses in order to restore our services. At 13:14UTC we put the block in place and our services started recovering.

We reported the issue to Hetzner and to AWS shortly after restoring our service. The next morning we received a friendly email from Andy, who works for one of our supporters at Plex, stating that they received a complaint from AWS. What happened next and how this matter was resolved is told by Andy himself:

Overnight on Wednesday, first thing Thursday morning, we received an abuse report that our servers were flooding an IP that corresponded to musicbrainz.org. We scrambled to investigate, as we are happy MusicBrainz partners, but it was a strange report because we run our MusicBrainz server and hit that instance rather than communicating with musicbrainz.org directly. And the IPs mentioned were specifically related to our metadata servers, not the IPs that would be receiving data updates from upstream. Just as some of our key engineering team members were starting to wake up and scrub in, it started to seem that it was a coincidence and we weren’t the actual source of the traffic and had simply been caught up in an overeager blocking of a large IP range to get the services back up. Never trust a coincidence. We continued to stay in touch with the team at MusicBrainz and within a few hours we had clear evidence that our IPs were the source of the traffic. We got the whole engineering team involved again to do some investigation, and we still couldn’t figure much out since we never make requests to musicbrainz.org and we had already worked to rule out the potential of any rogue access to our servers. By isolating our services and using our monitoring tools, we finally discovered that the issue was actually our traffic to coverartarchive.org, not musicbrainz.org, as they happen to be serviced by the same IP address. And this made much more sense, as we do depend on some API calls to the Cover Art Archive.

The root cause was an update to the Plex Media Server which had been released earlier in the week. There was a bug in that update that caused extra metadata requests to our own infrastructure. We had noticed the spike in our autoscaling to accommodate the extra traffic and already put together another update to fix the bug. That extra traffic on our infrastructure also translated to a more modest increase in requests to some of our metadata partners, including CAA. While the fix was already rolling out to Plex Media Servers, this provided a good opportunity to evaluate the CAA traffic and put our own rate limit in place to protect against future issues. We wrapped up that change in the afternoon on Thursday.

Throughout the ordeal, we appreciated the communication back and forth with our partners at MusicBrainz so that we could work together to investigate and follow leads to find a timely resolution.

Andy from Plex

While this whole situation was very stressful and frustrating to us, in the end it was resolved by a very friendly and technical detective game to identify and resolve the issue. It is always nice when geeks talk to geeks to resolve issues and get services working again. Thank you to Andy and his team — let’s hope we can avoid an issue like this in the future.

We’d apologize for the trouble caused by our IP address block and for our services being unavailable for several hours.

EDIT: We should also mention that all of our services are served from one single gateway IP address, so coverartarchive.org and musicbrainz.org have the same IP address.

Reminder: Upgrading to PostgreSQL 12 on May 18, 2020

As we announced in February, in two weeks time (May 18, 2020) we’ll be upgrading our production database server to PostgreSQL v12 (from v9.5). At the same time, v12 will become the minimum supported version for MusicBrainz Server, so we ask that you upgrade afterwards as soon as possible! If you’re still unsure, a Q&A is below.

When do I need to upgrade my postgres by?

As soon as possible after May 18 if you’d like to keep your musicbrainz-server code up to date.

How do I perform the upgrade?

We’ll provide instructions closer to May 18. It’s recommended that you don’t upgrade until then, since we’ll be providing scripts to resolve some issues.

Will the live data feed (replication packets) stop working right away if I don’t upgrade?

No, as long as you keep your musicbrainz-server code checkout on the v-2020-05-11 tag (which will be the final release before May 18) or earlier. Future releases may work for a while too.

This is not a schema change release, so replication will continue to work smoothly until you upgrade. No tables or views will change.

However, to make the upgrade process smoother we’ll be dropping the musicbrainz-collate and musicbrainz-unaccent extensions, instead using PG’s builtin collation support for the former and replacing the latter with the unaccent extension from postgresql-contrib. A few SQL functions are being added to enable this, and some indexes need to be rebuilt. This will all happen as part of upgrade scripts we provide (or you can import from scratch). Some features of musicbrainz-server that use these old extensions may cease to work if you don’t apply them.

The extension changes above don’t actually make use of any new PG 12 features. We’ll avoid using such features for at least 1 month.

If I’m already running PostgreSQL 12, do I need to do anything?

Yes, but things will be easier for you. As mentioned in the previous answer, we’ll be dropping the musicbrainz-collate and musicbrainz-unaccent extensions to make the upgrade process smoother for pre-v12 instances. So you’ll only have to run some upgrade scripts we provide to replace those extensions and rebuild some indexes.

My host/distribution doesn’t have PostgreSQL 12 yet!

If you’re running Debian or Ubuntu, the PGDG maintains an APT repository with the latest versions. These are the same packages MetaBrainz uses in production.

Amazon RDS supports PostgreSQL 12 since March 31.

I absolutely cannot upgrade yet! What should I do?

You can stay on the v-2020-05-11 release of musicbrainz-server or earlier until then. Replication packets (i.e. the live data feed) will continue to work until the next schema change on that tag, but you’ll have upgraded to v12 by then, right?

Instead of performing a pg_upgrade and running these upgrade scripts you mentioned, can I just import fresh data dumps into a new v12 cluster?

Of course. Just make sure your musicbrainz-server git checkout is on the v-2020-05-18 tag (once that’s released) or later before performing the import. And keep in mind it may be slower than a direct upgrade.

Upgrading Postgres instead of schema change: 18 May, 2020

Hello!

We’ve long procrastinated upgrading our production Postgres installation and we’ve decided to forego a schema change upgrade and instead upgrade Postgres to version 12.x. (We will migrate to whatever the latest stable version in the 12.x series will be).

This means that on 18 May we will not make any changes to the MusicBrainz schema, but  we will have some amount of down-time and/or read-only time while we upgrade Postgres on our production servers. We haven’t sorted out all of the exact details of how we will carry out this database upgrade, but the date is now confirmed.

If you operate a replicated instance of the MusicBrainz database we STRONGLY urge you to upgrade your installation shortly after we upgrade the production servers. After this release our team may start using Postgres features not available in Postgres 9.5.x, which is our current production version.

As usual for our releases that impact our downstream users, we will post many more details closer to the date and once the migration is complete, we will post detailed instructions on how you can upgrade your own installation.

Please post any questions you may have!

Thanks!

MusicBrainz Schema change upgrade downtime: 17:00 UTC (10am PST, 1pm EST, 19:00CEST)

Hi!

At 17:00 UTC (10am PST, 1pm EST, 19:00CEST) we will start the process of our schema change release. The exact time that we plan to start the change will depend on how long it takes to finish our preparations, but we expect it to be shortly after 17:00UTC.

Once we start the process we will put a banner notification on musicbrainz.org and we will also post updates to the @MusicBrainz twitter account, so follow us there for more details.

After the release is complete, we will post instructions here on how to upgrade your replicated MusicBrainz instances.

Freedb gateway: End of life notice, March 18, 2019

Many moons ago people clamoured for a way for them to use MusicBrainz via their old FreeDB (and others) enabled players. The hope was that this would be a short term solution as more players picked up MusicBrainz support so we created the FreeDB gateway that allowed old clients to use an ancient interface to look up CDs with MusicBrainz.

We’ve been maintaining this gateway for over 11 years now and recently we had a user asking questions about their new music player and that they were having a hard time getting it to work with our FreeDB gateway.

Wait, what? Someone is developing specifically for a stop-gap measure? Clearly the goal of FreeDB gateway has been misconceived and people are not treating it as a gateway to using the proper MusicBrainz API endpoints.

We’re no longer keen on supporting this gateway and have been having trouble finding volunteers to maintain it. Our internal staff has enough to do with our own duties and have no interest in further maintaining this and neither do I.

For these reasons the FreeDB gateway is going away in 6 months time; March 18 will be the absolute last day that the gateway will be functioning. Should something crash and the gateway experience problems before then, we’ll just kill the VM that the gateway is running on and call it a day.

11 years of temporary is enough — if you use this service, migrate to a proper MusicBrainz endpoint right now!

AcousticBrainz migration: We’re on!

We had to postpone the migration of AcousticBrainz last week since we ran out of time (our database is getting to be sizable!). We’ve migrated the bulk of our data and are now ready to move the last bits and call the move complete.

Downtime will start very soon — follow us on twitter for more detailed updates.

AcousticBrainz downtime: Migrating hosting to our other servers

Today we’re going to migrate the AcousticBrainz service from its standalone server that we’ve rented in the past few years to our shared infrastructure at Hetzner. We’ve been prepping for this move for a few weeks now and the actual process to follow has been used before, so we don’t expect the downtime to be more than 1 hour.

We’re sorry for the downtime that will be coming — to keep up with what we’re doing, please follow our progress on Twitter. We hope to start the migration in the next hour or two from when this blog post goes up.

 

The end of the replication nightmare!

I’m pleased to report that our nightmare of finding/reconstructing the missing replication packets is finally over!

Through many heroic hours of work, Bitmap and Chirlu have reconstructed the missing replication packets. All clients should now be on their way to being up to date. We’ve learned a number of lessons (some good, some bad — that’s life, right?) in this ordeal and we hope to avoid these issues in the future.

An integral part of this recovery process were a number of people from our community who helped us: Users mbcz, rembo10 and xeam sent us their complete DB dumps! Bitmap used these to sanity check and diff several other database to finally extract the missing packets. Thank you for dropping what you were doing and sending us a few GB of data over blazingly fast connections. Without you this would not have been possible; and this is not an exaggeration. Thank you!

After some more rest we’re going to continue to put out smaller fires that remain from the move to NewHost, but for now, the big fires are put out. Just in time for the weekend!

In the 11 year history of the replication stream we’ve had to have users restart their stream about 3-4 times because of problems on our end. Zero would’ve been nicer, but I’m proud that we’ve been able to make this system work for so long. On a daily basis we seem to have about 400 replicated copies of MusicBrainz running all over the world. Clearly this part of our service is well used and I sleep a little better at night knowing that our most critical data is backed up across the globe.

Just for fun, here is a graph of the replication API usage over the last 6 months:

hourly-api-usage.png

Towards the end the graph shows the week plus long break, then a small blip as some of our replicas got unstuck yesterday and the much larger spike shows the rest of the replicas getting unstuck. Now, as to what caused the blip in mid-October — I have no idea.

Anyways, please accept my apologies for the replication stream outage and keep replicating!

Thanks!