Recently the server has been hit by patches of instability – large load spikes, running out of memory, and processes getting killed here and there. When the most recent out-of-memory condition occurred (last night) the SSH server was one of the processes which got killed, which is why the server had to be rebooted a little while ago.
I’m fairly sure I more or less know what’s been causing the problems, and have made a few changes to try to reduce the chance of it happening again.
One of the worst causes of the problem is looking up a TRM with a large number of tracks. The worst TRM by far for this is the “silence” TRM, with (currently) over 900 tracks. As a result I’ve had to, for now at least, disallow lookups on this TRM – doing so will now simply return an error. Sorry 😦 Maybe it can be made to do something more helpful in future.
The other change is that if you do a lookup on any TRM which has more than 100 tracks then only 100 of those tracks will be returned. However so far there are no TRMs (except “silence”) with over 100 tracks, so this won’t affect anyone, yet. As the data grows, it will though.
Sorry for any inconvenience caused (hey, I’m apologising again. This is getting to be a habit). But I’m sure you’d rather have a server which doesn’t keep crashing and locking us all out. Hey ho.
Another test of the new captchas…
If you’d rather change the problem from “random procs being killed by the OoM killer” to “random procs being killed by malloc failures”, you can stop the kernel from overcommiting memory (and thus needing to invoke the OoM killer) by:
‘sysctl -w vm.overcommit_memory=2’
or
‘echo 2 > /proc/sys/vm/overcommit_memory’.
Thanks, that’s worth knowing. At the moment I think I’ve got the problem under control, but if ever that proves not to be the case then changing the overcommit policy sounds useful.
(Presumably a setting of “2” means that the thing doing the malloc gets killed, right? So some other process just minding its own business should be safe? If so, sounds good.)