I am glad to report that our problems are fixed and that our server is back to humming along nicely. The following is posted here so that if some other souls find themselves in our situation that they may learn form our experience:
What we changed:
- It was pointed out that max_connections of 500 was in fact insanely high, especially in light of using PGbouncer. Before we used PGbouncer we needed a lot more connections and when we started using PGbouncer, we never reduced this number.
- Our server_lifetime was set far too high (1 hour). Josh Berkus suggested lowering that to 5 minutes.
- We reduced the number of PGbouncer active connections to the DB.
What we learned:
- We had too many backends
- The backends were being kept around for too long by PGbouncer.
- This caused too many idle backends to kick around. Once we exhausted physical ram, we started swapping.
- Linux 3.2 apparently has some less than desirable swap behaviours. Once we started swapping, everything went nuts.
Going forward we’re going to upgrade our kernel the next time we have down time for our site and the rest should be sorted now.
Finally a word about Postgres itself:
Postgres rocks our world. I’m immensely pleased that once again the problems were our own stupidity and not Postgres’ fault. In over 10 years of using Postgres, problems with our site have never been Postgres’ fault. Not once.
Thanks to everyone who helped us through this tough time!
Congrats!
Just for my records:
Who is Josh Berkus?
He is one of the driving forces behind Postgres and a fellow GSoC mentor.
I am on the PG list used to help diagnose the issue. In addition to a great Postgres product, there is a wonderful safety net in place of selfless contributors with their time and knowledge. It’s comforting to see the commitment this community has to the Open Source principles and helping other to achieve greatness. PG does indeed rock! Kudos as well to the MusicBrainz team for transparency and setting an example of how to be “helpable”..
Glad to read this, and very happy to see that you can relax now.
Thanks for your hard work! 🙂
Thanks for this post. It’s very insightful. There isn’t a lot of high-scalability info out there for pgbouncer. So your findings are very useful.
> Before we used PGbouncer we needed a lot more connections and when we started using PGbouncer, we never reduced this number.
What did you reduce that number to?
> We reduced the number of PGbouncer active connections to the DB.
What does your number of pgbouncer active connections look like now?
Thanks!
We reduced max_connections to 100, where we had it before we used pgbouncer. I do believe recommended is 2 * number of cores, which for us ought to be 48. But 100 seems to be fine for us now.
We reduced the number of active connections, by reducing adding these timeout statements to our pgbouncer.ini:
server_lifetime=300
server_idle_timeout=300
I hope this helps.
Very helpful!
Thanks.