Incident report: January 27th service outage

On January 27th, starting at 4:31UTC we were hit with increasing amounts of traffic from what appeared to be hundreds of different IP addresses mostly belonging to Amazon Web Service IP addresses. At 8:46UTC the inbound traffic overwhelmed our systems and brought nearly all of our services to a standstill.

After investigating the situation and receiving no meaningful assistance from Hetzner (our ISP who advertises DDoS mitigation services as part of their offerings) we blocked three subnets of IP addresses in order to restore our services. At 13:14UTC we put the block in place and our services started recovering.

We reported the issue to Hetzner and to AWS shortly after restoring our service. The next morning we received a friendly email from Andy, who works for one of our supporters at Plex, stating that they received a complaint from AWS. What happened next and how this matter was resolved is told by Andy himself:

Overnight on Wednesday, first thing Thursday morning, we received an abuse report that our servers were flooding an IP that corresponded to musicbrainz.org. We scrambled to investigate, as we are happy MusicBrainz partners, but it was a strange report because we run our MusicBrainz server and hit that instance rather than communicating with musicbrainz.org directly. And the IPs mentioned were specifically related to our metadata servers, not the IPs that would be receiving data updates from upstream. Just as some of our key engineering team members were starting to wake up and scrub in, it started to seem that it was a coincidence and we weren’t the actual source of the traffic and had simply been caught up in an overeager blocking of a large IP range to get the services back up. Never trust a coincidence. We continued to stay in touch with the team at MusicBrainz and within a few hours we had clear evidence that our IPs were the source of the traffic. We got the whole engineering team involved again to do some investigation, and we still couldn’t figure much out since we never make requests to musicbrainz.org and we had already worked to rule out the potential of any rogue access to our servers. By isolating our services and using our monitoring tools, we finally discovered that the issue was actually our traffic to coverartarchive.org, not musicbrainz.org, as they happen to be serviced by the same IP address. And this made much more sense, as we do depend on some API calls to the Cover Art Archive.

The root cause was an update to the Plex Media Server which had been released earlier in the week. There was a bug in that update that caused extra metadata requests to our own infrastructure. We had noticed the spike in our autoscaling to accommodate the extra traffic and already put together another update to fix the bug. That extra traffic on our infrastructure also translated to a more modest increase in requests to some of our metadata partners, including CAA. While the fix was already rolling out to Plex Media Servers, this provided a good opportunity to evaluate the CAA traffic and put our own rate limit in place to protect against future issues. We wrapped up that change in the afternoon on Thursday.

Throughout the ordeal, we appreciated the communication back and forth with our partners at MusicBrainz so that we could work together to investigate and follow leads to find a timely resolution.

Andy from Plex

While this whole situation was very stressful and frustrating to us, in the end it was resolved by a very friendly and technical detective game to identify and resolve the issue. It is always nice when geeks talk to geeks to resolve issues and get services working again. Thank you to Andy and his team — let’s hope we can avoid an issue like this in the future.

We’d apologize for the trouble caused by our IP address block and for our services being unavailable for several hours.

EDIT: We should also mention that all of our services are served from one single gateway IP address, so coverartarchive.org and musicbrainz.org have the same IP address.

AcousticBrainz downtime: Migrating hosting to our other servers

Today we’re going to migrate the AcousticBrainz service from its standalone server that we’ve rented in the past few years to our shared infrastructure at Hetzner. We’ve been prepping for this move for a few weeks now and the actual process to follow has been used before, so we don’t expect the downtime to be more than 1 hour.

We’re sorry for the downtime that will be coming — to keep up with what we’re doing, please follow our progress on Twitter. We hope to start the migration in the next hour or two from when this blog post goes up.