On January 27th, starting at 4:31UTC we were hit with increasing amounts of traffic from what appeared to be hundreds of different IP addresses mostly belonging to Amazon Web Service IP addresses. At 8:46UTC the inbound traffic overwhelmed our systems and brought nearly all of our services to a standstill.
After investigating the situation and receiving no meaningful assistance from Hetzner (our ISP who advertises DDoS mitigation services as part of their offerings) we blocked three subnets of IP addresses in order to restore our services. At 13:14UTC we put the block in place and our services started recovering.
We reported the issue to Hetzner and to AWS shortly after restoring our service. The next morning we received a friendly email from Andy, who works for one of our supporters at Plex, stating that they received a complaint from AWS. What happened next and how this matter was resolved is told by Andy himself:
Overnight on Wednesday, first thing Thursday morning, we received an abuse report that our servers were flooding an IP that corresponded to musicbrainz.org. We scrambled to investigate, as we are happy MusicBrainz partners, but it was a strange report because we run our MusicBrainz server and hit that instance rather than communicating with musicbrainz.org directly. And the IPs mentioned were specifically related to our metadata servers, not the IPs that would be receiving data updates from upstream. Just as some of our key engineering team members were starting to wake up and scrub in, it started to seem that it was a coincidence and we weren’t the actual source of the traffic and had simply been caught up in an overeager blocking of a large IP range to get the services back up. Never trust a coincidence. We continued to stay in touch with the team at MusicBrainz and within a few hours we had clear evidence that our IPs were the source of the traffic. We got the whole engineering team involved again to do some investigation, and we still couldn’t figure much out since we never make requests to musicbrainz.org and we had already worked to rule out the potential of any rogue access to our servers. By isolating our services and using our monitoring tools, we finally discovered that the issue was actually our traffic to coverartarchive.org, not musicbrainz.org, as they happen to be serviced by the same IP address. And this made much more sense, as we do depend on some API calls to the Cover Art Archive.
The root cause was an update to the Plex Media Server which had been released earlier in the week. There was a bug in that update that caused extra metadata requests to our own infrastructure. We had noticed the spike in our autoscaling to accommodate the extra traffic and already put together another update to fix the bug. That extra traffic on our infrastructure also translated to a more modest increase in requests to some of our metadata partners, including CAA. While the fix was already rolling out to Plex Media Servers, this provided a good opportunity to evaluate the CAA traffic and put our own rate limit in place to protect against future issues. We wrapped up that change in the afternoon on Thursday.
Throughout the ordeal, we appreciated the communication back and forth with our partners at MusicBrainz so that we could work together to investigate and follow leads to find a timely resolution.Andy from Plex
While this whole situation was very stressful and frustrating to us, in the end it was resolved by a very friendly and technical detective game to identify and resolve the issue. It is always nice when geeks talk to geeks to resolve issues and get services working again. Thank you to Andy and his team — let’s hope we can avoid an issue like this in the future.
We’d apologize for the trouble caused by our IP address block and for our services being unavailable for several hours.
EDIT: We should also mention that all of our services are served from one single gateway IP address, so coverartarchive.org and musicbrainz.org have the same IP address.
5 thoughts on “Incident report: January 27th service outage”
Interesting story, thanks for sharing.
What I don’t get:
If “the issue was actually our [PLEX] traffic to coverartarchive.org, not musicbrainz.org”, why should musicbrainz.org “be hit with increasing amounts of traffic from what appeared to be hundreds of different IP addresses mostly belonging to Amazon Web Service IP addresses”?
Because all of our traffic goes through one single gateway IP address, so coverartarchive.org and musicbrainz.org all have the same IP address. I’ll see about making the post more clear on that point.
> let’s hope we can avoid an issue like this in the future
Are you basically saying that you don’t have any DoS (not to mention DDoS) protection in place and your solution is hope?
That’s not what I said. The DDoS protection we have did not trigger since, we surmise, the attack didn’t fit the pattern of typical attacks and that its magnitude was not sufficient for Hetzner’s system to catch it.
A good philosophy with serious outages is to
(1) make sure the circumstances don’t happen again and
(2) ensure better handling if they do (or something similar does) happen again!
So, if such an occurence happens tomorrow at 4.31, what will happen to resolve it and at what time?
This logic should then be followed through assuming each supposed resolution proves unsuccessful in itself.
Also, at some stage, eg. heavy users need to be automatically throttled or disabled from using the service.
If the initial resolution(s) is/are unsuccessful (for whatever reason) what will happen by 8.46 when all access is lost?
ie. (from my user point of view) at some stage MB should have ensured there’s eg. some kind of IP/server switch? to a static (enquiry only) copy (yesterday’s?) of the database elsewhere.
Amazon must have expertise in this area, from when they’re down.
“The gateway page of Amazon.com was offline to some customers for approximately 49 minutes. Other pages of the site were accessible and AWS was not impacted.” Amazon’s Ty Rogers