Postgres troubles

(Regular readers of this blog, please ignore this post. We’re casting a wide net to try and find help for our problems.)

UPDATE: This problem has been resolved and all of our services are returning to their normally dubious service levels. For a technical explanation of what went wrong, see here.

Dear Postgres gurus:

We at MusicBrainz have been very happy postgres users for over a decade now and Postgres is something that gives is very few headaches compared to all the other things that we run. But last week we started having some really vexing issues with our server. Here is some back-story:

http://blog.musicbrainz.org/2015/03/14/hosting-issues-downtime-tonight/

When our load spiked, we did the normal set of things that you do:

  • Check for missing indexes, made some new ones, no change. (see details below)
  • Examined for new traffic; none of our web front end servers showed an increase in traffic.
  • Eliminated non-mission-critical uses of the DB server: stop building indexes for search, turn off lower priority sites. No change.
  • Review the performance settings of the server. Debate each setting as a team and tune. shared_buffers and work_mem tuning has made the server more resilient to recover from spikes, but still, we get massive periodic spikes.

From a restart, everything is happy and working well. Postgres will use all available ram for a while, but stay out of swap, exactly what we want it to do. But then, it tips the scales and digs into swap and everything goes to hell. We’ve studied this post for quite some time and ran queries to understand how Posgres manages its ram:

http://www.depesz.com/2012/06/09/how-much-ram-is-postgresql-using/

And sure enough ram usage just keeps increasing and once we go beyond physical ram, it goes into swap. Not rocket science. We’ve noticed that our back ends keep growing in size. According to top, once we start having processes that are 10+% of ram, we’re nearly on the cusp of entering swap. It happens predictably time and time again. Selective use of pg_terminate_backend() of these large back ends can keep us out of swap. A new, smaller backend gets created, RAM usage goes down. However, this is hardly a viable solution.

We’re now on Postgres 9.1.15, and we have a lot of downstream users who also need to upgrade when we do, so this is something that we need to coordinate months in advance. Going to 9.4 is out in the short term. 🙁 Ideally we can figure out what might be going wrong so we can fix it post-haste. MusicBrainz has been barely usable for the past few days. 🙁

One final thought: We have an several tables from a previous version of the DB sitting in the public schema not being used at all. We keep meaning to drop those tables, but haven’t gotten around to it yet. They tables are not being used at all, so we assume that they should not impact the performance of Postgres. Might this be a problem?

So, any tips or words of advice you have for us, would be deeply appreciated. And now for way too much information about our setup:

Postgres:

9.1.15 (from ubuntu packages)

Host:

  • Linux totoro 3.2.0-57-generic #87-Ubuntu SMP Tue Nov 12 21:35:10 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
  • 48GB ram
  • Raid 1,0 disks
  • PgBouncer in use.
  • Running postgres is its only task

postgresql.conf:

archive_command = '/bin/true'
archive_mode = 'on'
autovacuum = 'on'
checkpoint_segments = '128'
datestyle = 'iso, mdy'
default_statistics_target = '300'
default_text_search_config = 'pg_catalog.english'
data_directory = '/home/postgres/postgres9'
effective_cache_size = '30GB'
hot_standby = 'on'
lc_messages = 'en_US.UTF-8'
lc_monetary = 'en_US.UTF-8'
lc_numeric = 'en_US.UTF-8'
lc_time = 'en_US.UTF-8'
listen_addresses = '*'
log_destination = 'syslog'
log_line_prefix = '<%r %a %p>'
log_lock_waits = 'on'
log_min_duration_statement = '1000'
maintenance_work_mem = '64MB'
max_connections = '500'
max_prepared_transactions = '25'
max_wal_senders = '3'
custom_variable_classes = 'pg_stat_statements'
pg_stat_statements.max = '1000'
pg_stat_statements.save = 'off'
pg_stat_statements.track = 'top'
pg_stat_statements.track_utility = 'off'
shared_preload_libraries = 'pg_stat_statements,pg_amqp'
shared_buffers = '12GB'
silent_mode = 'on'
temp_buffers = '8MB'
track_activities = 'on'
track_counts = 'on'
wal_buffers = '16MB'
wal_keep_segments = '128'
wal_level = 'hot_standby'
wal_sync_method = 'fdatasync'
work_mem = '64MB'

pgbouncer.ini:

[databases]
musicbrainz_db_20110516 = host=127.0.0.1 dbname=musicbrainz_db_20110516

[pgbouncer]
pidfile=/home/postgres/postgres9/pgbouncer.pid
listen_addr=*
listen_port=6899
user=postgres
auth_file=/etc/pgbouncer/userlist.txt
auth_type=trust
pool_mode=session
min_pool_size=10
default_pool_size=320
reserve_pool_size=10
reserve_pool_timeout=1.0
idle_transaction_timeout=0
max_client_conn=400
log_connections=0
log_disconnections=0
stats_period=3600
stats_users=postgres
admin_users=musicbrainz_user

Monitoring:

To see these, enter these anti-spam passwords: User: “musicbrainz” passwd: “musicbrainz”

Load: http://stats.musicbrainz.org/mrtg/drraw/drraw.cgi?Mode=view;Template=1196205794.8081;Base=%2Fvar%2Fwww%2Fmrtg%2F%2Ftotoro_load.rrd

Disk IO: http://stats.musicbrainz.org/mrtg/drraw/drraw.cgi?Mode=view;Template=1196376086.1393;Base=%2Fvar%2Fwww%2Fmrtg%2F%2Ftotoro_diskstats-sda-count.rrd

RAM Use: http://stats.musicbrainz.org/mrtg/drraw/drraw.cgi?Mode=view;Template=1196204920.6439;Base=%2Fvar%2Fwww%2Fmrtg%2F%2Ftotoro_disk-physicalmemory.rrd

Swap use: http://stats.musicbrainz.org/mrtg/drraw/drraw.cgi?Mode=view;Template=1196204920.6439;Base=%2Fvar%2Fwww%2Fmrtg%2F%2Ftotoro_disk-swapspace.rrd

Processes: http://stats.musicbrainz.org/mrtg/drraw/drraw.cgi?Mode=view;Template=1196376477.1968;Base=%2Fvar%2Fwww%2Fmrtg%2F%2Ftotoro_processes.rrd

Indexes:

We ran the query from this suggestion to identify possible missing indexes:

http://stackoverflow.com/questions/3318727/postgresql-index-usage-analysis

this is our result:

https://gist.github.com/mayhem/423b084043235fb78642

Most of these tables are tiny and kept in ram. Postgres opts to not use any indexes we create on the DB, so no change.

UPDATES:

  • Five months ago we double the RAM from 24GB to 48GB, but our traffic has not increased.
  • We’ve set kernel.swapiness to 0 with no real change.
  • free -m:
             total       used       free     shared    buffers     cached
Mem:         48295      31673      16622          0          5      12670
-/+ buffers/cache:      18997      29298
Swap:        22852       2382      20470

Service downtime to fix some database issues

This Friday we’re going to need to take a 15-20 minute downtime to fix a few leftover issues from our recent schema change. We tried to do this without downtime, but the service got progressively slower, so we’re electing to take some downtime.

We’ll be down shortly after Noon PST, 3PM EST, 20:00 UK, 21:00 CET for about 15-20 minutes.

Sorry for the hassles this causes.

Downtime for fall schema change

Our next schema change version will be released on Monday, 17 November, 2014 around Noon PST/3pm EST/20:00 GMT/21:00 CET. We expect that MusicBrainz will be unavailable for 30 – 60 minutes during this time. We will put up the downtime notification on the site and tweet from @musicbrainz right before the release.

Sadly, our backup database server suffered a hardware failure and we ran out of time to get a replicated database setup after the hardware was fixed. This means that we won’t be able to put the site into read-only mode and will require us to take a full-downtime.

It sucks and we’re not happy about it either, but there is only so much we can accomplish with our limited resources. 🙁

Sorry for any troubles this may cause you.

MusicBrainz main site downtime

On October 30, 2014 at 2pm Pacific Time (5pm Eastern, 21:00 UK/GMT, 22:00 CET) we’re going to upgrade the RAM in our database server. This means that we’ll have a complete site-wide outage for about 15 minutes.

We’ll tweet details from @musicbrainz as the time approaches and we’ll have a notice on the site shortly before the down time.

Sorry for the inconvenience.

Phew: MusicBrainz is unaffected by the Heartbleed fiasco

As you’ve probably heard all over the net, a vulnerability was found in a very popular and critical piece of software that a lot of sites on the net use. While we also use said piece of software, our version is a bit older and therefore we’re not affected by this bug. There is no need to change your password, unless you use a common password on any of these affected sites listed in the link above.

To get an idea of how serious this issue is, take a look at this stunning list of affected sites!

But, don’t take our word for it. Use this link to check to see if musicbrainz.org is affected.

Now go and change your passwords. NOW!

New editing feature: adding new entities from inline search fields

As mentioned in the blog post for the 2013-12-23 server release, a useful new feature for editors is the option to add new entities directly from inline search fields:

Image

There’s an “Add a new [entity]” button at the bottom of (most) search result menus, which upon clicking will open up a dialog in the page. The dialog contains a form identical to the one you’re used to when creating entities the old-fashioned way — that is, from the Editing menu at the top of the website.

Once you’ve successfully added a new entity from the dialog, it’ll be automatically loaded into the search field you started from.

The visible exceptions to this feature are that you can’t add new areas (only location editors can add those), or releases (because those take a lot longer to add, and it wouldn’t be useful to do so in a small dialog that you can accidentally close), and finally, you can’t spawn an add-entity dialog within another add-entity dialog. 🙂

Another change that went along with this feature (that editors should be aware of) is that we now require artists and labels to be selected on the “Release Information” tab of the release editor.

Image

Previously, you could enter plain text into these fields, proceed to the next tab without selecting a search result, and handle creation on the “Add Missing Entities” tab. Because “Add Missing Entities” has a very limited UI, in the future we’d like to remove it in favor of easier entity-creation on the other tabs. This is a small step towards that. Note that you can still use “Add Missing Entities” for track artists — this change only affects the release artist and release labels on the first tab.

dcYWRqQ

Don’t hesitate to report any bugs or suggestions about this feature to our issue tracker: http://tickets.musicbrainz.org/

All MusicBrainz sites downtime

On Sunday, December 29th at 1pm PST, (2pm AZ, 4pm EST, 9pm UK, 10pm CET) we’re going to swap out our network switch. During this time all MusicBrainz sites hosted in California will be unavailable. (that is all sites, save for the primary and secondary FTP mirrors and the FreeDB gateway).

The work will not start exactly at 1pm, but we’re doing to start executing our plan at 1pm. The exact time for the outage will be announced via Twitter and via the banner on musicbrainz.org

We hope that this outage will last only 10-15 minutes, but as these things typically go, you’ll never know how long it will really take.

Sorry for the inconvenience.

MusicBrainz Read Only Between 2PM – 3PM UTC

We need to do essential maintainance on the main MusicBrainz database server today (specifically upgrading to Ubuntu 12.04 LTS). We’re aiming to get this work completed within an hour, and will require MusicBrainz to be in read-only mode for the duration of the upgrade. We’re going to go read-only from 2PM UTC (6AM PST, 9AM EST, 3PM CEST) to begin this work.

Sorry for the inconvenience!

Venue and Studio Support: Introducing Places

MusicBrainz now supports venues and studios via our new “place” entity!

This was one of our Google Summer of Code projects for this year and many thanks to Nicolás Tamargo for his work on it. We released his work a few weeks ago and after a few initial hiccups, it’s looking good and we want to let you all know about it. 🙂

So what can we do with places?

The most obvious thing we can do now is store information about recording, mixing and mastering locations.

For example, the studios listed in the credits for Universe by Kyoko Fukada:

places-releasecredits

and the venue for the recordings on Live in Cartoon Motion by Mika:

places-recordingcredits

We can of course link the place to a variety of external sites, as can be seen in the list of URLs for Wembley Arena:

places-urls

Some places are made up of several parts. In those cases, we can link one place as being part of another. For example the various studios at Abbey Road Studios:

places-parts

or the hall and theatre of the Barbican Centre:

places-parts2

We were already able to add engineers to the database as artists, now we can also say which studio they work at, as seen here for the studio Railroad Tracks:

places-engineers

Many orchestras and sometimes other artists have a home venue where they perform on a regular basis. These can now be linked, like in you can see for the Barbican Centre: Barbican Hall:

places-primaryvenue

A premiere is sometimes held for a work and now we can link those works to where the premiere was held, e.g. the following works which were premiered at Carnegie Hall:

places-premiere

The place can also have coordinates, which make it possible to pinpoint the location on a map. The MusicBrainz website doesn’t show any maps at present, but here’s a map of all places with coordinates by Mineo:

places-map50

Events?

No, we do not yet support events.

Thanks to nikki for writing this post.

We’re going to take the HTTPS plunge!

Yesterday in our dev meeting we agreed to take the HTTPS plunge for all of our web site traffic in as little as 2 weeks time. This means that all web site traffic (not the web service) will be served over HTTPS; if you visit any MusicBrainz HTTP URL (e.g. http://musicbrainz.org ) you will be redirected to the equivalent HTTPS URL (e.g. https://musicbrainz.org ). This will not be applied to our web services, you’ll still be able to access those with HTTP. However, we do encourage all of our web service users to make use of HTTPS when possible.

We have one bug to address before we make this switch. And if we can find a sufficient fix for this in time, we’re going to make the HTTPS switch on 16 September 2013. If we can’t find an acceptable fix, we’ll have to postpone this switchover.

If for some reason you can see that switching all web site traffic to HTTPS is a bad idea, please leave us a comment ASAP.