Downtime today for PostgreSQL / MusicBrainz schema change upgrade: 17:00 UTC (10am PT, 1pm ET, 7pm CEST)

Today (Monday, May 13) at 17:00 UTC (10am PT, 1pm ET, 7:00pm CEST), we’ll be:

  • Upgrading our production database server to PostgreSQL v16.
  • Performing the MusicBrainz schema change upgrade.

See the previous announcement on this topic for more information.

Expect MusicBrainz and services that depend on its database (MetaBrainz, ListenBrainz, the Cover Art Archive, CritiqueBrainz, BookBrainz) to be down for the hour, but we’ll be working to restore services as quickly as possible.

Afterward, we’ll post instructions here on how to upgrade your MusicBrainz mirror server (whether using musicbrainz-docker or otherwise).

P.S. The initially announced upgrades for MusicBrainz search engine are just about to reach our beta website, and thus are postponed for mirrors too.

Incident report: January 27th service outage

On January 27th, starting at 4:31UTC we were hit with increasing amounts of traffic from what appeared to be hundreds of different IP addresses mostly belonging to Amazon Web Service IP addresses. At 8:46UTC the inbound traffic overwhelmed our systems and brought nearly all of our services to a standstill.

After investigating the situation and receiving no meaningful assistance from Hetzner (our ISP who advertises DDoS mitigation services as part of their offerings) we blocked three subnets of IP addresses in order to restore our services. At 13:14UTC we put the block in place and our services started recovering.

We reported the issue to Hetzner and to AWS shortly after restoring our service. The next morning we received a friendly email from Andy, who works for one of our supporters at Plex, stating that they received a complaint from AWS. What happened next and how this matter was resolved is told by Andy himself:

Overnight on Wednesday, first thing Thursday morning, we received an abuse report that our servers were flooding an IP that corresponded to musicbrainz.org. We scrambled to investigate, as we are happy MusicBrainz partners, but it was a strange report because we run our MusicBrainz server and hit that instance rather than communicating with musicbrainz.org directly. And the IPs mentioned were specifically related to our metadata servers, not the IPs that would be receiving data updates from upstream. Just as some of our key engineering team members were starting to wake up and scrub in, it started to seem that it was a coincidence and we weren’t the actual source of the traffic and had simply been caught up in an overeager blocking of a large IP range to get the services back up. Never trust a coincidence. We continued to stay in touch with the team at MusicBrainz and within a few hours we had clear evidence that our IPs were the source of the traffic. We got the whole engineering team involved again to do some investigation, and we still couldn’t figure much out since we never make requests to musicbrainz.org and we had already worked to rule out the potential of any rogue access to our servers. By isolating our services and using our monitoring tools, we finally discovered that the issue was actually our traffic to coverartarchive.org, not musicbrainz.org, as they happen to be serviced by the same IP address. And this made much more sense, as we do depend on some API calls to the Cover Art Archive.

The root cause was an update to the Plex Media Server which had been released earlier in the week. There was a bug in that update that caused extra metadata requests to our own infrastructure. We had noticed the spike in our autoscaling to accommodate the extra traffic and already put together another update to fix the bug. That extra traffic on our infrastructure also translated to a more modest increase in requests to some of our metadata partners, including CAA. While the fix was already rolling out to Plex Media Servers, this provided a good opportunity to evaluate the CAA traffic and put our own rate limit in place to protect against future issues. We wrapped up that change in the afternoon on Thursday.

Throughout the ordeal, we appreciated the communication back and forth with our partners at MusicBrainz so that we could work together to investigate and follow leads to find a timely resolution.

Andy from Plex

While this whole situation was very stressful and frustrating to us, in the end it was resolved by a very friendly and technical detective game to identify and resolve the issue. It is always nice when geeks talk to geeks to resolve issues and get services working again. Thank you to Andy and his team — let’s hope we can avoid an issue like this in the future.

We’d apologize for the trouble caused by our IP address block and for our services being unavailable for several hours.

EDIT: We should also mention that all of our services are served from one single gateway IP address, so coverartarchive.org and musicbrainz.org have the same IP address.

State of the Brainz: 2019 MetaBrainz Summit highlights

The 2019 MetaBrainz Summit took place on 27th–29th of September 2019 in Barcelona, Spain at the MetaBrainz HQ. The Summit is a chance for MetaBrainz staff and the community to gather and plan ahead for the next year. This report is a recap of what was discussed and what lies ahead for the community.

Continue reading “State of the Brainz: 2019 MetaBrainz Summit highlights”

Google donates $10,000 in cloud computing credits. Thank you!

The Google Open Source Programs Office continues to support MetaBrainz in a number of ways and most recently they donated $10,000 in credit toward their cloud services. Thank you Google!

This credit allows us to run some services in the cloud to round out primary hosting setup — this gives us a some redundancy and allows us to not keep all of our critical eggs in one basket. We can also give our open source developers Virtual Machines from time to time, since a lot of our projects are very data heavy. Having access to a fat VM can sometimes turn a really frustrating project that makes your laptop melt into a project that is satisfying to watch chug along.

Thank you again, Google, the Open Source Programs Office and in particular, Cat Allman!

AcousticBrainz migration: We’re on!

We had to postpone the migration of AcousticBrainz last week since we ran out of time (our database is getting to be sizable!). We’ve migrated the bulk of our data and are now ready to move the last bits and call the move complete.

Downtime will start very soon — follow us on twitter for more detailed updates.

AcousticBrainz downtime: Migrating hosting to our other servers

Today we’re going to migrate the AcousticBrainz service from its standalone server that we’ve rented in the past few years to our shared infrastructure at Hetzner. We’ve been prepping for this move for a few weeks now and the actual process to follow has been used before, so we don’t expect the downtime to be more than 1 hour.

We’re sorry for the downtime that will be coming — to keep up with what we’re doing, please follow our progress on Twitter. We hope to start the migration in the next hour or two from when this blog post goes up.

 

Move to NewHost and Replication Update

It has been a long week since our move to the new hosting provider in Germany. Our move across the Atlantic worked out fairly well in the grand scheme of things. The new servers are performing well, the site is more stable and we have a modern infrastructure for most of our projects.

However, such moves are not without problems. While we didn’t encounter many problems, the most significant one we did encounter was the failure to copy two small replication packets off the old servers. We didn’t notice this problem until after the server in question had been decommissioned. Ooops.

And thus began a recovery effort that is almost worthy of a bad Hollywood B-movie plot. Between myself traveling and the team finishing the most critical migration bits, it took 2 days for us to realize the problem and find a volunteer to fetch the drives from the broken server. Only in a small and wealthy place such as San Luis Obispo, could a stack of recycled servers sit in an open container for 2 days and not be touched at all. My friend collected the drives and immediately noticed that the drives were damaged in the recycling process, which isn’t surprising. And we can consider ourselves really lucky that this drive didn’t contain private data — those drives have been physically destroyed!

Since then, my friend has been working with Linux disk recovery tools to try and recover the two replication packets off the drive. Given that he is working with a 1TB drive, this recovery process takes a while and must be fully completed before attempting to pull data off the drive. For now we wait.

At the same time, we’re actively cobbling together a method to regenerate the lost packets. In theory it is possible, but it involves heroic efforts of stupidity. And we’re expending that effort, but so far, it bears no fruit.

In the meantime, for all of the people who use our replicated (Live Data) feed — you have the following choices:

  1. If you need data updates flowing again as soon as possible, we strongly recommend importing a new data set. We have a new data dump and fresh replication packets being put out, so you can do this at any time you’re ready.
  2. If the need for updates is not urgent yet and you’d rather not reload the data, sit tight. We’re continuing our stupidly heroic efforts to recover the replication packets.
  3. Chocolate: It really makes everything better. It may not help with your data problems, but at least it takes the edge off.

We’re terribly sorry for the hassle in all of this! Our geek pride has been sufficiently dinged that our chocolate coping mechanisms will surely cause us to put on a pound or two.

Stay tuned!

UPDATE 1: The first recovery examination has not located the files, but my friend will do a second pass tomorrow and turn over file fragments to us that might allow us to recover files. But that won’t be for another 8 hours or so.

We’re actually really going to take the HTTPS plunge!

Closing in on three years after stating that “We’re going to take the HTTPS plunge!”, we’re actually really going to do it now. 🙂

Most of our sites have forced HTTPS for some time (metabrainz.org, critiquebrainz.org, bookbrainz.org, listenbrainz.org), but there are still a couple of stragglers, notably musicbrainz.org and acousticbrainz.org.

For MusicBrainz, our beta site is now all HTTPS, web service and all. The main, non-beta musicbrainz.org will be going HTTPS-only except for what’s under /ws/ (ie., the web service) to allow taggers and other programs not currently using HTTPS some transition time. We do not currently have an ETA for when we will make the final jump to HTTPS-only on the MusicBrainz web service, as that partly depends on feedback from our web service users, which leads me to:

If you’re currently using the MusicBrainz web service, please try and switch your program to using beta.musicbrainz.org and see whether your program breaks or not and let us know the status of it. We are aware that some Python versions and MusicBrainz libraries do not support our setup, so while your program may fail now, it might simply be because of dependencies of your program not being updated yet and you might not need to do anything specifically on your end – however, some programs/libraries might need some updates, so the more people test and report back, the better we’ll be able to judge when we can go all-HTTPS-only on musicbrainz.org.

For AcousticBrainz, we now have a shiny new Let’s Encrypt certificate on https://acousticbrainz.org thanks to our systems administrator Zas! As a result, we are going to start redirecting all HTTP traffic to HTTPS on the AcousticBrainz website, including API queries.

In order to give everyone time to verify that their scripts correctly recognise and validate our Let’s Encrypt certificate, we are going to delay the redirect until July 1, 2016. On this date, any HTTP query will automatically be redirected to HTTPS. We will also enable HSTS, so that compliant browsers will redirect to HTTPS on the client-side.

If you have any questions about either the MusicBrainz or the AcousticBrainz transition, please ask.

Laurent Monin joins the team as a part-time sysadmin

For the first time in a number of years we have a person responsible for system administration! Over the past few years we’ve been trying to spread the duties to maintain our servers among our developers. This only worked so well and the duties are piling up and not being attended to.

With the introduction of our new MetaBrainz site in May, we finally have an increasing revenue stream, which allows us to finally hire a paid sysadmin. Hopefully we can work on our back log of tasks now.

Laurent Monin (aka Zas) is no stranger to our project — he has been hacking on Picard for a number of years and he attended last year’s summit in Copenhagen. I’m quite happy to have found a community member and long-standing contributor to take on this task.

Some of the first tasks that Laurent will take on are from direct feedback from our blog series about community improvements. We’re hoping to consolidate our mailing lists and forums into a Discourse instance and then provide single sign on for Discourse, our Wiki and Jira. Stuff we’ve talked about for years, but never have made any progress on.

I’m quite excited to have Zas on board! Welcome!