Postgres troubles

(Regular readers of this blog, please ignore this post. We’re casting a wide net to try and find help for our problems.)

UPDATE: This problem has been resolved and all of our services are returning to their normally dubious service levels. For a technical explanation of what went wrong, see here.

Dear Postgres gurus:

We at MusicBrainz have been very happy postgres users for over a decade now and Postgres is something that gives is very few headaches compared to all the other things that we run. But last week we started having some really vexing issues with our server. Here is some back-story:

http://blog.musicbrainz.org/2015/03/14/hosting-issues-downtime-tonight/

When our load spiked, we did the normal set of things that you do:

  • Check for missing indexes, made some new ones, no change. (see details below)
  • Examined for new traffic; none of our web front end servers showed an increase in traffic.
  • Eliminated non-mission-critical uses of the DB server: stop building indexes for search, turn off lower priority sites. No change.
  • Review the performance settings of the server. Debate each setting as a team and tune. shared_buffers and work_mem tuning has made the server more resilient to recover from spikes, but still, we get massive periodic spikes.

From a restart, everything is happy and working well. Postgres will use all available ram for a while, but stay out of swap, exactly what we want it to do. But then, it tips the scales and digs into swap and everything goes to hell. We’ve studied this post for quite some time and ran queries to understand how Posgres manages its ram:

http://www.depesz.com/2012/06/09/how-much-ram-is-postgresql-using/

And sure enough ram usage just keeps increasing and once we go beyond physical ram, it goes into swap. Not rocket science. We’ve noticed that our back ends keep growing in size. According to top, once we start having processes that are 10+% of ram, we’re nearly on the cusp of entering swap. It happens predictably time and time again. Selective use of pg_terminate_backend() of these large back ends can keep us out of swap. A new, smaller backend gets created, RAM usage goes down. However, this is hardly a viable solution.

We’re now on Postgres 9.1.15, and we have a lot of downstream users who also need to upgrade when we do, so this is something that we need to coordinate months in advance. Going to 9.4 is out in the short term. 🙁 Ideally we can figure out what might be going wrong so we can fix it post-haste. MusicBrainz has been barely usable for the past few days. 🙁

One final thought: We have an several tables from a previous version of the DB sitting in the public schema not being used at all. We keep meaning to drop those tables, but haven’t gotten around to it yet. They tables are not being used at all, so we assume that they should not impact the performance of Postgres. Might this be a problem?

So, any tips or words of advice you have for us, would be deeply appreciated. And now for way too much information about our setup:

Postgres:

9.1.15 (from ubuntu packages)

Host:

  • Linux totoro 3.2.0-57-generic #87-Ubuntu SMP Tue Nov 12 21:35:10 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
  • 48GB ram
  • Raid 1,0 disks
  • PgBouncer in use.
  • Running postgres is its only task

postgresql.conf:

archive_command = '/bin/true'
archive_mode = 'on'
autovacuum = 'on'
checkpoint_segments = '128'
datestyle = 'iso, mdy'
default_statistics_target = '300'
default_text_search_config = 'pg_catalog.english'
data_directory = '/home/postgres/postgres9'
effective_cache_size = '30GB'
hot_standby = 'on'
lc_messages = 'en_US.UTF-8'
lc_monetary = 'en_US.UTF-8'
lc_numeric = 'en_US.UTF-8'
lc_time = 'en_US.UTF-8'
listen_addresses = '*'
log_destination = 'syslog'
log_line_prefix = '<%r %a %p>'
log_lock_waits = 'on'
log_min_duration_statement = '1000'
maintenance_work_mem = '64MB'
max_connections = '500'
max_prepared_transactions = '25'
max_wal_senders = '3'
custom_variable_classes = 'pg_stat_statements'
pg_stat_statements.max = '1000'
pg_stat_statements.save = 'off'
pg_stat_statements.track = 'top'
pg_stat_statements.track_utility = 'off'
shared_preload_libraries = 'pg_stat_statements,pg_amqp'
shared_buffers = '12GB'
silent_mode = 'on'
temp_buffers = '8MB'
track_activities = 'on'
track_counts = 'on'
wal_buffers = '16MB'
wal_keep_segments = '128'
wal_level = 'hot_standby'
wal_sync_method = 'fdatasync'
work_mem = '64MB'

pgbouncer.ini:

[databases]
musicbrainz_db_20110516 = host=127.0.0.1 dbname=musicbrainz_db_20110516

[pgbouncer]
pidfile=/home/postgres/postgres9/pgbouncer.pid
listen_addr=*
listen_port=6899
user=postgres
auth_file=/etc/pgbouncer/userlist.txt
auth_type=trust
pool_mode=session
min_pool_size=10
default_pool_size=320
reserve_pool_size=10
reserve_pool_timeout=1.0
idle_transaction_timeout=0
max_client_conn=400
log_connections=0
log_disconnections=0
stats_period=3600
stats_users=postgres
admin_users=musicbrainz_user

Monitoring:

To see these, enter these anti-spam passwords: User: “musicbrainz” passwd: “musicbrainz”

Load: http://stats.musicbrainz.org/mrtg/drraw/drraw.cgi?Mode=view;Template=1196205794.8081;Base=%2Fvar%2Fwww%2Fmrtg%2F%2Ftotoro_load.rrd

Disk IO: http://stats.musicbrainz.org/mrtg/drraw/drraw.cgi?Mode=view;Template=1196376086.1393;Base=%2Fvar%2Fwww%2Fmrtg%2F%2Ftotoro_diskstats-sda-count.rrd

RAM Use: http://stats.musicbrainz.org/mrtg/drraw/drraw.cgi?Mode=view;Template=1196204920.6439;Base=%2Fvar%2Fwww%2Fmrtg%2F%2Ftotoro_disk-physicalmemory.rrd

Swap use: http://stats.musicbrainz.org/mrtg/drraw/drraw.cgi?Mode=view;Template=1196204920.6439;Base=%2Fvar%2Fwww%2Fmrtg%2F%2Ftotoro_disk-swapspace.rrd

Processes: http://stats.musicbrainz.org/mrtg/drraw/drraw.cgi?Mode=view;Template=1196376477.1968;Base=%2Fvar%2Fwww%2Fmrtg%2F%2Ftotoro_processes.rrd

Indexes:

We ran the query from this suggestion to identify possible missing indexes:

http://stackoverflow.com/questions/3318727/postgresql-index-usage-analysis

this is our result:

https://gist.github.com/mayhem/423b084043235fb78642

Most of these tables are tiny and kept in ram. Postgres opts to not use any indexes we create on the DB, so no change.

UPDATES:

  • Five months ago we double the RAM from 24GB to 48GB, but our traffic has not increased.
  • We’ve set kernel.swapiness to 0 with no real change.
  • free -m:
             total       used       free     shared    buffers     cached
Mem:         48295      31673      16622          0          5      12670
-/+ buffers/cache:      18997      29298
Swap:        22852       2382      20470

Server update, 2015-03-09

This one’s mostly a small bug-fix release, since improvements to sitemaps and other big projects have been concurrently in the works. One thing of note is that the “external links editor” on entity edit pages was rewritten to be more maintainable in the long run; hopefully no bugs have creeped in with the new code, but if any have then please report them on our issue tracker!

Thanks to reosarevok and the MetaBrainz team for working on the changes below. The git tag is v-2015-03-09.

Bug

  • [MBS-8055] – Series-Series relationships are blocked by JS
  • [MBS-8151] – Filename not being shown in remove cover art edits
  • [MBS-8221] – Work pages don’t show event rels
  • [MBS-8270] – Bottom display of work relationships is inaccurate for works appearing across multiple tracks

Improvement

  • [MBS-7913] – Allow seeding of non-URL ARs when creating non-Release entities via URL parameters
  • [MBS-8258] – Rewrite the external links editor in React

Server update, 2015-02-23

We have another server release out today (a little over a day late, as given away by the title). The most exciting change this time around is CritiqueBrainz integration: on release group pages, the most popular and most recent reviews from CritiqueBrainz are now displayed, along with links to write your own. Hopefully people find these reviews useful and interesting and are encouraged to start contributing to CritiqueBrainz themselves!

As usual, the release contains a variety of other bug fixes and improvements, detailed in the changelog below. Thanks very much to chirlu, Freso, reosarevok, and the MetaBrainz team for their work on today’s release. The git tag is v-2015-02-23.

Bug

  • [MBS-4928] – It is possible for non-auto-editor users to submit ‘approve’ votes
  • [MBS-5959] – _uniq indices for editor_subscribe_{artist,editor,label} not unique
  • [MBS-6763] – Entity merges don’t save Ended if there’s no date
  • [MBS-8200] – ws/js/edit does not validate release event dates
  • [MBS-8201] – Dates with year == 0 are invisible
  • [MBS-8203] – Spurious spaces in relationship target lists
  • [MBS-8224] – Events can’t be rated via /ws/2/ratings

Improvement

  • [MBS-6164] – Filter out duplicate annotations when merging
  • [MBS-6885] – hard limit height of wikipedia excerpts the same as annotations
  • [MBS-7477] – Add a unique constraint on track (medium, position)
  • [MBS-7872] – Show CritiqueBrainz reviews on the release group pages
  • [MBS-8205] – Remove entity browse pages
  • [MBS-8220] – Align Artist columns in discographies

Task

  • [MBS-3998] – Editor may vote / may add edit note checks should also be done at the Data level.
  • [MBS-6451] – Update cleanup code for Amazon India and Brazil
  • [MBS-8107] – Add autoselect for Sina Weibo URLs
  • [MBS-8176] – Add autoselect for CD Japan URLs
  • [MBS-8192] – Add autoselect for work-level IMDb rels
  • [MBS-8199] – Unset the date for release events that have 0000-00-00
  • [MBS-8226] – Add ClassicalArchives.com to Other DBs whitelist

Server update, 2015-02-09

We’re back with another release! We’ve got a good amount in this release, with patches from the MetaBrainz team as well as chirlu, nikki, reosarevok, Freso, JesseW, and Johan Hattne. A short preview of changes:

  • Releases with many mediums (more than 10) now don’t load medium data by default — with JavaScript on, they can be expanded in place, and without javascript there are newly-minted pages per medium. This also removes the regrettable requirement for JavaScript being on to display release data, which we’d had as a temporary improvement here.
  • Table of Contents offsets for CDs are now returned when the discid endpoints are used.
  • The subscription emails were running into an issue with memory usage, which means some users were not getting emails (or only getting emails sporadically). The script has been reworked to have predictable and lower memory usage so that it’s not killed. Good job on our users for providing it a hard enough task it finally fell over though!
  • Importing and replication should now work with the newest version of DBD::Pg, which made a change that broke our code.
  • A variety of new improvements (many style-related) and bugfixes.

The git tag for this release is v-2015-02-09.

The full list of tickets fixed:

Bug

  • [MBS-2948] – Set track lengths from disc id should indicate the mediums that use the tracklist
  • [MBS-3452] – Merged release annotation is in wrong order
  • [MBS-3841] – Loading releases or release groups with lots of discs often gives 502
  • [MBS-5692] – Internal server error when trying to create a new user with a blank database
  • [MBS-7238] – span.name-variation disappeared from release pages
  • [MBS-7402] – release page regression. inline external URL relationships link to MB only
  • [MBS-7442] – User asked to confirm unchanged “strange” barcode
  • [MBS-7900] – Duplicated tags in tag-list element from ws/2 XML request
  • [MBS-7920] – Nonfunctional rating on release group merge page
  • [MBS-7968] – Inline search triggers searches when the field only contains whitespace
  • [MBS-7985] – Attributes help link does nothing
  • [MBS-8010] – Entity tabs are broken when adding a release group to an existing artist
  • [MBS-8019] – Release tabs broken on the remove disc ID page
  • [MBS-8056] – Loading large-ish releases is very slow in Opera 12
  • [MBS-8058] – Documentation says viewing data doesn’t require JS, which is wrong for releases
  • [MBS-8084] – lightbox html escaping
  • [MBS-8095] – “Edits for your subscriptions” emails have not been sent for at least 3 days
  • [MBS-8104] – Tracklist times entered without colon doesn’t work for times that are not three digits
  • [MBS-8117] – 504 Gateway Time-out when trying to view edits for a Release Collection
  • [MBS-8118] – “Editor flag” edit search criterion fails if given multiple values
  • [MBS-8126] – Import scripts fail with newest DBD::Pg
  • [MBS-8156] – Timeline graph checkboxes are broken in Opera
  • [MBS-8157] – “Script too large” error prevents releases from rendering
  • [MBS-8159] – No secondary release group types in recording search results
  • [MBS-8185] – Series and Events’ “other db” URL rels are not currently being validated against the whitelist
  • [MBS-8191] – Collection edit lists are not made sufficiently private

Improvement

  • [MBS-2656] – Sorting in subscription emails should use collation
  • [MBS-2737] – Limit size of release page
  • [MBS-5752] – Make sure Trove URLs are autocorrected depending on the entity to which they’re linked
  • [MBS-7943] – Add support for TheSession.org events
  • [MBS-8040] – Add admin interface for editing attribute tables
  • [MBS-8164] – Add tests for event collections to Data::Collection

New Feature

  • [MBS-8180] – Return TOC offsets with disc ID in web service

Task

  • [MBS-8106] – Add Instagram URLs to the sidebar
  • [MBS-8138] – Add a bunch of sites to the whitelist for DBs (II)
  • [MBS-8189] – Move last.fm autoselect to the new last.fm relationship

Server update, 2015-01-26

A small release this fortnight as our various developers have been working on bigger projects. However, a few fixes made it in, including making edit searches which time out a lot nicer to use/reword, some better documentation, and slightly better coordinate-parsing.

Thanks to bitmap, chirlu, and the rest of the usual team for this release!

Bug

  • [MBS-7396] – Move Disc ID box out of place
  • [MBS-7984] – Edit searches which time out prevent the user from improving their search
  • [MBS-8002] – Width missing on checkbox column of events list
  • [MBS-8081] – “Remove selected releases from collection” no longer works
  • [MBS-8097] – “Add relationship type” edits don’t store the ID of the new type
  • [MBS-8113] – Strip whitespace when parsing coordinates

Improvement

  • [MBS-6439] – Pasting MBIDs into search fields should not be a hidden feature
  • [MBS-7565] – manual→automatic series edit leads to potentially great data loss
  • [MBS-8101] – Don’t immediately start the edit search when following a “Refine” link
  • [MBS-8122] – Allow switching to direct search when inline (indexed) search errors
  • [MBS-8131] – Remove the “CD in hand” suggestion from the no edit note message

Server update, 2015-01-12

Our first release of the new year is out today, containing the usual set of editing bug fixes and improvements. There’s also a small new feature in the release editor that allows propagating the release title and artist credit to the release group.

Anyone experiencing failures in InitDb.pl after upgrading DBD::Pg to version 3.5.0 will want to get this update, since it contains a workaround.

Thanks very much to chirlu, navap, nikki, and the MetaBrainz team for their work on today’s release.

The git tag is v-2015-01-12 and the full changelog is below.

Bug

  • [MBS-4904] – Edit search automatically focuses text fields on page load
  • [MBS-8025] – Spurious "performance" entries in work search results
  • [MBS-8032] – Internal server error applying problematic add ISRC edits
  • [MBS-8047] – Applied merge release groups link has broken links
  • [MBS-8099] – Edit search automatically focuses text fields when changing dropdowns
  • [MBS-8108] – Edit removing track with track mbid redirect can’t close
  • [MBS-8116] – Place edit form acts on outdated parse-coordinates responses
  • [MBS-8126] – InitDb.pl fails with newest DBD::Pg

Improvement

  • [MBS-2895] – Make it possible to propagate release changes to the release group
  • [MBS-6162] – Mention mbslave in INSTALL
  • [MBS-7498] – Recognise Deezer URLs
  • [MBS-7733] – Recognise and clean up 7digital.com URLs
  • [MBS-7924] – Match e-onkyo, hd-music and ototoy URLs with "purchase for download" relationship type
  • [MBS-8077] – Clean up mobile Soundcloud and ReverbNation URLs
  • [MBS-8078] – Match and clean up Google Play URLs
  • [MBS-8080] – Show CAA-down warning also when the Archive is down completely

New Feature

  • [MBS-7688] – Add autoselect for Vine URLs

Task

  • [MBS-5671] – MBServer internal code using old search json format should use new format

Server update, 2014-12-29

This release is quite small because it overlapped with the holidays, but contains a fix for an annoying guess-case issue in the release editor that several people reported. Thanks to chirlu for the rest of the bug fixes listed here!

The git tag is v-2014-12-29.

Bug

  • [MBS-4668] – Edit search loses selections when the page number is too high
  • [MBS-8064] – Can’t add documentation to migration-created relationship types
  • [MBS-8086] – Guess case for release title does not trigger an edit
  • [MBS-8091] – Can’t fix non-normalized artist credits

Server update, 2014-12-15

We’ve released another server update today, though a fairly modest one, since bigger changes have been happening in the background with embedding JSON-LD into our pages and switching our JavaScript over to browserify. A few more edit types have been made auto-edits for everyone, as detailed in the changelog below. There’s additionally been some misc. UI changes, and a fix to output artist genders in the JSON webservice.

Thanks to chirlu, nikki, reosarevok, and the MetaBrainz team for their hard work on today’s release.

The git tag is v-2014-12-15.

Bug

  • [MBS-7916] – “Set track durations” preview is broken for single-track mediums
  • [MBS-8008] – “Add Event” (entity) and “Add Event” (release event) share the same translation string
  • [MBS-8044] – Map doesn’t zoom when pasting coordinates
  • [MBS-8066] – “Date” field wraps inconsistently in different browsers
  • [MBS-8082] – Gender missing on artist lookup (JSON)

Improvement

  • [MBS-7478] – Make more use of HTML5 form field types
  • [MBS-7970] – Replace guess case bubbles with icons next to the fields
  • [MBS-7973] – Make remove alias/ISRC/ISWC edits auto-edits for auto-editors
  • [MBS-7975] – Make move disc ID edits auto-edits
  • [MBS-8045] – Update link_event to include dates

Server update, 2014-12-01

A day late, but hopefully no dollars short, we’re back with another release. This release is mostly bug fixes, as you’d expect right after a schema change release. nikki has done some work with our CSS, however, and chirlu did some work updating and clarifying our INSTALL.md. Thanks to them, mineo, and the MetaBrainz team for their work this release!

The git tag for this release is v-2014-12-01.

The usual list of bugs fixed:

Bug

  • [MBS-4232] – Edit artist shows artists credits section when it doesn’t apply
  • [MBS-4622] – Improve handling of cover art when JS is off
  • [MBS-6971] – No controls for uploading cover artwork in IE8
  • [MBS-7497] – cpanm install fails with Can’t locate File/Copy/Recursive.pm in @INC
  • [MBS-7952] – Release editor does not handle removed mediums correctly when reordering mediums
  • [MBS-7988] – Adding a pregap track to an existing tracklist and it shows as a data track
  • [MBS-7992] – Data track option doesn’t behave properly with disc IDs
  • [MBS-7993] – Edit medium edits for pre-gap tracks claim to change data tracks
  • [MBS-7995] – DBDefs.pm.sample is still at schema version 20
  • [MBS-8011] – Last.fm event URLs don’t get matched
  • [MBS-8013] – Cannot add data track to release
  • [MBS-8018] – “TypeError: MB.typeInfoByID[i] is undefined” when creating an event series
  • [MBS-8022] – Sorting event collections doesn’t work
  • [MBS-8024] – FixTrackLength script is setting incorrect lengths on pregap tracks
  • [MBS-8029] – Collections overview uses “Releases” even for events
  • [MBS-8033] – Encoding issue in some “explanation balloons”
  • [MBS-8034] – Duplicate relationships with attributes cause internal server error in release relationship editor

Improvement

  • [MBS-6332] – Allow sorting releases in collections by release date
  • [MBS-6333] – Allow sorting releases in collections by release country
  • [MBS-7824] – Add Barcode column to Collections releases view
  • [MBS-7997] – Show area in event lists
  • [MBS-8001] – Use larger Gravatar images on higher pixel density displays
  • [MBS-8009] – URL Cleanup should clean FB event links better
  • [MBS-8036] – Add URL matching for setlist.fm

Task

  • [MBS-6682] – Add eu.ftp.musicbrainz.org to INSTALL.md

Schema change upgrade instructions, schema 21

This upgrade shouldn’t be substantially different than past upgrades, now that we’ve fixed a few bugs with the process. To upgrade:

  1. Make sure your REPLICATION_TYPE setting is RT_SLAVE and your DB_SCHEMA_SEQUENCE is set to 20 in lib/DBDefs.pm.
  2. Ensure you’ve replicated up to the most recent replication packet available with the old schema. (if you’re not sure, run ./admin/replication/LoadReplicationChanges and see what it tells you).
  3. Take down the web server running MusicBrainz, if you’re running a web server.
  4. Turn off cron jobs if you are automatically updating the database via cron jobs.
  5. Switch to the new code with git fetch origin followed by git checkout schema-change-20-to-21
  6. Run ./upgrade.sh (or carton exec -Ilib -- ./upgrade.sh if you’re using carton, with very old setups).
  7. Set DB_SCHEMA_SEQUENCE to 21 in lib/DBDefs.pm
  8. Turn cron jobs back on, if needed.
  9. Restart the MusicBrainz web server, if applicable. It’s also recommended you restart memcached.

That’s it! The only real difference from the past is the specific tag to be used: schema-change-20-to-21, which is a couple of fix-up commits past the regular release tag.