Bug, fixed: a scheduler job that was killing performance on the Kerika server

Our apologies to anyone who was affected by this bug: the email scheduler we built (about a month ago) had a bug that caused CPU utilization to periodically spike all the way up to 100%, and this in turn caused the server to temporarily freeze. We have fixed this bug, thanks in part to a couple of intrepid users, from Poland and the UK, who gave us some important clues.

By way of background: we have a scheduler program that runs periodically on the Kerika server doing various daily tasks. One such task is resending invitations that have not yet been accepted (or rejected, as the case may be); another task is providing a daily summary of each user’s outstanding tasks.

The resending of invitations takes place at a fixed time, but the creation of the daily summary is more complex, since the system sends each user his/her daily summary at 6AM based upon that user’s last known location. This means, for example, that a user based in Seattle would get his daily summary sent at 6AM Pacific Time, and a user based in India would get her summary sent at 6AM Indian Standard Time.

The bug: there was some overly complex SQL queries being used by the scheduler that was causing the server’s CPU consumption to spike all the way up to 100%. In effect, when the scheduler did one of these complex queries, nothing else could move on the server, and the result was an erratic user experience. Not good.

Why it wasn’t found before: because the scheduler ran at different times during the day, based upon the geo-location/distribution of our users, the behavior was not observed in a consistent manner. As our user base grew, the spikes occurred at different times during the day, and we didn’t make the connection.

How it was fixed: the old, complex SQL queries were taking 3-4 seconds to execute; replacing them with a couple of simple queries cut the time down to 1-2 milliseconds. That’s a 2,000X improvement!

Lessons learned: avoid complacency, even for what look like simple, routine programming. Use a profiler. And always respond to user complaints within 24 hours or less, like we always have.