We had problems occasionally with our servers running reliably, and if you were unlucky you may have noticed this.
Well, it took a very, very long time but we have finally figured out what’s happening.
It turns out that the garbage collection function on the Java Virtual Machine that runs our server software (all on a Linux virtual machine running on Amazon Web Service) was having problems: most of the time the garbage collection runs just fine on an incremental basis, taking up only a fraction of a second of CPU time, and periodically the JVM would do a full garbage collection as well.
The problem we are facing is that sometimes this full garbage collection would fail and immediately restart.
In the worst-case scenario, the garbage collection would start, fail and restart over and over again, until the server basically thrashed. And each time the full garbage collection ran, it took 5-7 seconds of CPU time (which is a really long time, btw).
We are trying to understand the best long-term solution for this, but in the short-term we can mitigate the problem in a variety of ways, including upgrading our server virtual machines to have more RAM.
One reason it took so long to debug is that we were chasing a red herring: we had noticed network spikes happening frequently, and we wrongly assumed these were correlated to the server’s CPU load spiking, but this turned out to be coincidental rather than causal.
Sorry about all this.