I’m pleased to report that our nightmare of finding/reconstructing the missing replication packets is finally over!
Through many heroic hours of work, Bitmap and Chirlu have reconstructed the missing replication packets. All clients should now be on their way to being up to date. We’ve learned a number of lessons (some good, some bad — that’s life, right?) in this ordeal and we hope to avoid these issues in the future.
An integral part of this recovery process were a number of people from our community who helped us: Users mbcz, rembo10 and xeam sent us their complete DB dumps! Bitmap used these to sanity check and diff several other database to finally extract the missing packets. Thank you for dropping what you were doing and sending us a few GB of data over blazingly fast connections. Without you this would not have been possible; and this is not an exaggeration. Thank you!
After some more rest we’re going to continue to put out smaller fires that remain from the move to NewHost, but for now, the big fires are put out. Just in time for the weekend!
In the 11 year history of the replication stream we’ve had to have users restart their stream about 3-4 times because of problems on our end. Zero would’ve been nicer, but I’m proud that we’ve been able to make this system work for so long. On a daily basis we seem to have about 400 replicated copies of MusicBrainz running all over the world. Clearly this part of our service is well used and I sleep a little better at night knowing that our most critical data is backed up across the globe.
Just for fun, here is a graph of the replication API usage over the last 6 months:
Towards the end the graph shows the week plus long break, then a small blip as some of our replicas got unstuck yesterday and the much larger spike shows the rest of the replicas getting unstuck. Now, as to what caused the blip in mid-October — I have no idea.
Anyways, please accept my apologies for the replication stream outage and keep replicating!
Thanks!
Keep up the great work! Thank you.
Would you mind to share your “lessons learned”?
Mine personally:
1. Review every step of the game plan with every team member.
2. Identify very important steps.
3. Add steps to verify the important steps
4. I should’ve kept at least one drive for safe-keeping off each server, rather than immediately recycling them.
Quite possibly #4 may never ever apply again. But still…
alexandria.io or ipfs could help resolve the problems