Server Troubles

Recently the server has been hit by patches of instability – large load spikes, running out of memory, and processes getting killed here and there. When the most recent out-of-memory condition occurred (last night) the SSH server was one of the processes which got killed, which is why the server had to be rebooted a … Continue reading “Server Troubles”

Recently the server has been hit by patches of instability – large load spikes, running out of memory, and processes getting killed here and there. When the most recent out-of-memory condition occurred (last night) the SSH server was one of the processes which got killed, which is why the server had to be rebooted a little while ago.

I’m fairly sure I more or less know what’s been causing the problems, and have made a few changes to try to reduce the chance of it happening again.

One of the worst causes of the problem is looking up a TRM with a large number of tracks. The worst TRM by far for this is the “silence” TRM, with (currently) over 900 tracks. As a result I’ve had to, for now at least, disallow lookups on this TRM – doing so will now simply return an error. Sorry 😦 Maybe it can be made to do something more helpful in future.

The other change is that if you do a lookup on any TRM which has more than 100 tracks then only 100 of those tracks will be returned. However so far there are no TRMs (except “silence”) with over 100 tracks, so this won’t affect anyone, yet. As the data grows, it will though.

Sorry for any inconvenience caused (hey, I’m apologising again. This is getting to be a habit). But I’m sure you’d rather have a server which doesn’t keep crashing and locking us all out. Hey ho.

3 thoughts on “Server Troubles”

  1. If you’d rather change the problem from “random procs being killed by the OoM killer” to “random procs being killed by malloc failures”, you can stop the kernel from overcommiting memory (and thus needing to invoke the OoM killer) by:
    ‘sysctl -w vm.overcommit_memory=2’
    or
    ‘echo 2 > /proc/sys/vm/overcommit_memory’.

  2. Thanks, that’s worth knowing. At the moment I think I’ve got the problem under control, but if ever that proves not to be the case then changing the overcommit policy sounds useful.

    (Presumably a setting of “2” means that the thing doing the malloc gets killed, right? So some other process just minding its own business should be safe? If so, sounds good.)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.