Final sprint towards NewHost

So tomorrow is the day when the old servers at Digital West will be put to rest. Our developers and system administrator have been hard at work over the last few weeks (and especially the last few days!) getting everything ready for a hopefully smooth transition to hosting everything at Hetzner in Germany.

The vast majority of our sites and services are in fact already being served from Germany, but the largest one—musicbrainz.org—remains in the US for another night.

Our transition of the final services, such as musicbrainz.org, has already started, but the very last sprint of moving the remaining bits over will start tomorrow, November 8th, at around 6 AM EST / noon UTC / 13:00 CET and will follow the plan laid out in this Google document:
https://docs.google.com/document/d/1MgqZ4hiKC0MZJ400ZCOD8JlcjBjh4GNpakauf51YKQQ/edit?usp=sharing

Expect some downtime, plenty of read-only time, and general wonkiness as we get the last gears set in place to hopefully keep us running smoothly for the next several years to come!

Massive connectivity issues

As you are probably aware, we’ve been having lots of network connectivity issues with all services hosted at Digital West in California (all of our projects, except ListenBrainz and AcousticBrainz).

Today we spent all morning trying to replace what we thought to be a faulty switch. That process didn’t go very well at all – we hit every conceivable issue that we could’ve hit. And a few more.

But, in this process we connected our gateway machines directly to our uplink (not through our switch) and the network issues persisted! After testing this setup with both of our machines, we’ve now conclusively eliminated all of our equipment as the possible source of trouble.

At this point our troubles lie in the hands of Digital West to fix. Thankfully the day staff will return to work in a few hours and hopefully we will make some progress on this issue then.

Sorry for all of this hassle. 🙁

Server capacity update

Zas and I have been working hard to improve the capacity and stability of the site. In the last week, we’ve identified and fixed at least 3 problems with the search servers and we’ve added a timeout function that times out queries that take longer than 3 seconds. We think that the main cause of trouble was that queries were piling up after a slow query ran too long and that the servers never recovered from that and consequently crashed.

We won’t go as far as saying that the search servers are fixed — every time we have a smidgen of hope that things are improving, they crash again. Seemingly out of spite! So, the search servers are better. 😉

Zas has also made a number of changes to the gateways and how we rate limit our incoming traffic. The rate limiting is now being done in a smarter way that reduces the overall traffic on our web servers. Well done!

We’ve also increased our bandwidth budget by 4mbits per second, which makes the site feel considerably more responsive.

Let me put these improvement into numbers: About a week ago were were struggling to keep up 250 requests per second and the site felt very sluggish. Now we can handle 500 requests a second and the site feels considerably faster. For large chunks of the day we are managing to handle all the traffic we should handle. And, the search servers haven’t crashed in 4 days!

We hope that this will give us a solid base from which to release the scheme upgrade tomorrow. Then once that is complete, we will start work on moving to the new hosting company.

Thanks for being patient with us!

We’re actually really going to take the HTTPS plunge!

Closing in on three years after stating that “We’re going to take the HTTPS plunge!”, we’re actually really going to do it now. 🙂

Most of our sites have forced HTTPS for some time (metabrainz.org, critiquebrainz.org, bookbrainz.org, listenbrainz.org), but there are still a couple of stragglers, notably musicbrainz.org and acousticbrainz.org.

For MusicBrainz, our beta site is now all HTTPS, web service and all. The main, non-beta musicbrainz.org will be going HTTPS-only except for what’s under /ws/ (ie., the web service) to allow taggers and other programs not currently using HTTPS some transition time. We do not currently have an ETA for when we will make the final jump to HTTPS-only on the MusicBrainz web service, as that partly depends on feedback from our web service users, which leads me to:

If you’re currently using the MusicBrainz web service, please try and switch your program to using beta.musicbrainz.org and see whether your program breaks or not and let us know the status of it. We are aware that some Python versions and MusicBrainz libraries do not support our setup, so while your program may fail now, it might simply be because of dependencies of your program not being updated yet and you might not need to do anything specifically on your end – however, some programs/libraries might need some updates, so the more people test and report back, the better we’ll be able to judge when we can go all-HTTPS-only on musicbrainz.org.

For AcousticBrainz, we now have a shiny new Let’s Encrypt certificate on https://acousticbrainz.org thanks to our systems administrator Zas! As a result, we are going to start redirecting all HTTP traffic to HTTPS on the AcousticBrainz website, including API queries.

In order to give everyone time to verify that their scripts correctly recognise and validate our Let’s Encrypt certificate, we are going to delay the redirect until July 1, 2016. On this date, any HTTP query will automatically be redirected to HTTPS. We will also enable HSTS, so that compliant browsers will redirect to HTTPS on the client-side.

If you have any questions about either the MusicBrainz or the AcousticBrainz transition, please ask.

State of the Onion: MetaBrainz

In the past few weeks we’ve been hit with several traffic increases to MusicBrainz which is putting considerably more strain on our aging infrastructure than we’re happy with. If it seems that we’re not doing anything about it, that is because we’ve been busy behind the scenes trying to keep things moving forward. This sometimes doesn’t leave us a lot of time to keep the public informed on our work. Hopefully this blog post will fix this in the short term:

In 2011 we started to make plans to move MusicBrainz hosting into the cloud, but then out of the blue we were donated a pile of machines. There were so many machines that I postponed the cloud plans and prepared the donated machines for service. That has carried us for 4+ years with almost no hardware cost, which was really great. The plan was to move to the cloud sometime around 2015, but then I spent most of 2014/2015 dealing with conflicts in the team, putting us seriously behind schedule while our hardware decayed.

On top of that, we’ve recently had some “bad luck”. We have had some disrespectful commercial customers hit us really hard and we had to find and block them. We have had unexpected traffic spikes and when trying to address these unexpected traffic spikes, we had two more machines fail on us. These were the donated machines that we kept in reserve for just this moment. The loss of two machines caught us short on capacity to handle the increased demands on our servers.

So, now we face the tough question: Do we buy expensive hardware that we might use for 6 months (~$5000) or do we try and save the money and tough it out? I’d rather not spend so much money on such short term use if we can avoid it. We’re going to try and move to a new hosting facility somewhere in the EU, since that is where most of our users are.

Moving to a new hosting facility has an incredible number of dependencies that Christina (our Biz Dev manager), Zas and I have been working through. It may not seem like we have a plan, but we do, and we’re incredibly busy trying to make the plan happen. To give you a taste of what we’re up against:

  1. We want to move our hosting to Europe and have a business presence in Europe in order to reduce the costs and inefficiencies of being a solely US based business. A lot of our traffic, customers and contractors are in the EU and it simply makes sense to have a presence here.
  2. To establish a presence in the EU I needed local help to help with the business matters as well as researching and establishing an EU organization. So I needed to find a Biz Dev manager and that person is Christina.
  3. Once Christina was on board she researched our options about what was suited for us. Getting that process moving involved getting certified documents from California, board approval for spending funds to establish the organization, EU labor law research, (and we needed to swap a board member, too!), hiring help to establish the org. and generally navigating the Spanish bureaucracy. (See this only slightly exaggerated short film for some clues of our ordeal.)
  4. Once the org. had been established we needed to convince the bank to open a bank account for us. The draconian US banking laws extend worldwide and the local bank had to ensure that they were not opening themselves up to thousands of $$$ in accounting hassles just to allow a tiny non-profit to open a bank account. We finally have a bank account and have started paying our contractors with it!
  5. At the same time we’re also working to set up an office for the growing team here in Barcelona. That required a byzantine process that barely started when you sign the lease. Getting power, internet and water set up has taken a frustratingly long time. Had I known how long, I would have stayed at my co-working space for a while longer while addressing hosting issues.
  6. While Christina has been focused on the hardcore paperwork, Zas is keeping the site running, which itself requires many heroics. Zas and I have started planning the move to the EU hosting provider. We’ve got a 5-page document that collects some of the open questions and requirements around this process: https://docs.google.com/document/d/16KNm4KksNwz29Opk1aILOMtCmPIeXFuxxUjMoPT3th0/edit#heading=h.dpfvoz1idcro. Right now Zas and Bitmap are here in Barcelona and we’re going to work on establishing a formal plan for moving to the new hosting company. We’re currently comparing hosting company offerings – see what we’ve collected so far if you care to follow along. The amount of work required to make this happen is making my head hurt. (A special shoutout to KodeStar, lead developer of FanArt.tv, for providing a lot of useful feedback about our various options.)
  7. While Christina, Zas and I have our hands full, Bitmap and Gentlecat continue to release new features and work on the schema change. Not to mention all the contributions from Freso and Reosarevok to keep the community happy and polite while we deal with less than optimal site conditions. That said, I am really happy and proud of my team, trying to keep things running in sub-optimal conditions.

This is just a snapshot of everything that is happening behind the scenes that will culminate with the goal of moving to a new hosting company and being set up in the EU. And mind you, we’re doing this with a minuscule budget trying to be careful of how we spent our money.

6 degrees of Vince Gill

I’m not sure that we’ve talked about this cool project yet, so I’ll catch up on that now. The new site Six Degrees of Vince Gill allows you to enter an artist name and see how many degrees of separation there are between your artist and Vince Gill. This project comes from Universal Music’ Nashville group — I’m happy to see our data get used in interesting ways like this!

Now, if you want to see someone relate to Vince Gill in seven degrees, have a look at how I relate to him. 🙂

Screen Shot 2016-03-16 at 17.45.29.

 

 

Laurent Monin joins the team as a part-time sysadmin

For the first time in a number of years we have a person responsible for system administration! Over the past few years we’ve been trying to spread the duties to maintain our servers among our developers. This only worked so well and the duties are piling up and not being attended to.

With the introduction of our new MetaBrainz site in May, we finally have an increasing revenue stream, which allows us to finally hire a paid sysadmin. Hopefully we can work on our back log of tasks now.

Laurent Monin (aka Zas) is no stranger to our project — he has been hacking on Picard for a number of years and he attended last year’s summit in Copenhagen. I’m quite happy to have found a community member and long-standing contributor to take on this task.

Some of the first tasks that Laurent will take on are from direct feedback from our blog series about community improvements. We’re hoping to consolidate our mailing lists and forums into a Discourse instance and then provide single sign on for Discourse, our Wiki and Jira. Stuff we’ve talked about for years, but never have made any progress on.

I’m quite excited to have Zas on board! Welcome!

A positive outlook going forward

My next installment of MusicBrainz management changes focuses on how we should frame our discussions going forward. Currently there is a lot of animosity in our community and a lot of finger pointing — neither of these are constructive for moving forward, so I will aim to cut these short and focus on fixing rather than blaming.

I’d like to offer an analogy to start this discussion: When two people are in a personal relationship and when that relationship starts falling apart, a lot of negative feelings come up. The two people will often blame each other and be convinced that the other person is the reason for all of their troubles. If you’ve ever had an opportunity to talk to two people in a failing relationship, you’ll probably have seen that failing relationships are usually the fault of both people. I’ve yet to find a relationship that failed, solely on the actions of one person alone. Both people are involved, both people had a hand in it.

That said, I’ll step forward and say it: I am guilty. I am partially to blame for what is going on. Go ahead, feel free to blame me for the troubles we’re facing.

But, that is it. Basta! We’re not going to engage in finding every little thing that was done wrong, by whom and work hard to lay blame. That is pointless and it brings up unnecessary emotions. Instead of finding blame we’re going to find problems to our solutions and we’re going to move forward.

As part of me restructuring MusicBrainz, I’m going to be asking everyone what problems they perceive with the project right now. I will listen to the problems, catalog them and attempt to build a plan for tackling these problems in the future. However, I will insist that problems are stated without aggressive communication (e.g. passive aggressive communication) and without value judgements. If you cannot state your issue without being aggressive or disrespectful, you can count on me calling you on your behaviour. I will not address problems that are stated in an aggressive or disrespectful manner.

For instance, it is not acceptable to say: “I don’t think that anyone is going to listen to me anyway, but I think that because of Joe’s idiotic decision to not allow white space in code, all of our code is a freaking mess — this was the worst idea ever!” This statement has passive aggressive communication, it lays blame and contains a value judgement. One way to express the same concern in a constructive manner could be: “The decision to exclude whitespace from our code has created a number of difficulties for people to follow our code. We should re-consider this decision.”

This means of expressing problems, ideas and solutions allows us to focus our energy on moving forward and improving the project. It avoids painful discussions that won’t gives us much insight on moving forward. As we work to mend our community, I will be relying on these communication tools heavily. If you run afoul of these new communication guidelines, expect me to remind of you of this blog post. 🙂

Postgres troubles resolved

I am glad to report that our problems are fixed and that our server is back to humming along nicely. The following is posted here so that if some other souls find themselves in our situation that they may learn form our experience:

What we changed:

  1. It was pointed out that max_connections of 500 was in fact insanely high, especially in light of using PGbouncer. Before we used PGbouncer we needed a lot more connections and when we started using PGbouncer, we never reduced this number.
  2. Our server_lifetime was set far too high (1 hour). Josh Berkus suggested lowering that to 5 minutes.
  3. We reduced the number of PGbouncer active connections to the DB.

What we learned:

  1. We had too many backends
  2. The backends were being kept around for too long by PGbouncer.
  3. This caused too many idle backends to kick around. Once we exhausted physical ram, we started swapping.
  4. Linux 3.2 apparently has some less than desirable swap behaviours. Once we started swapping, everything went nuts.

Going forward we’re going to upgrade our kernel the next time we have down time for our site and the rest should be sorted now.

Finally a word about Postgres itself:

Postgres rocks our world. I’m immensely pleased that once again the problems were our own stupidity and not Postgres’ fault. In over 10 years of using Postgres, problems with our site have never been Postgres’ fault. Not once.

Thanks to everyone who helped us through this tough time!