Amazon burped a little on the weekend

We use a number of Amazon Web Services, including one called Simple Queue Service which Kerika uses to handle communications between our main project database server and a separate server that handles the Search function.

  • As with all search engines, Kerika’s Solr engine does a full indexing of the database only once: when the database is rebuilt for any reason (which happens very rarely), and after that it does incremental indexing which means that it only looks at changes made to individual boards, cards, and canvases.
  • Using a queue helps us manage the load of traffic going to the search engine server: in the unlikely event that a lot of people make a lot of updates to their Kerika boards at the same time, Solr won’t get overwhelmed with a sudden burst of new indexing.
  • There are lots of ways to implement queues in software — in fact, studying queuing theory is a standard course in all computer science programs — and at this point most apps, like Kerika, prefer not re-invent that particular wheel: instead, it is more cost-effective to use some standard queuing facility that’s available as part of the underlying platform.

AWS works very well in our opinion — it has very high reliability across most of its services — but like all software, it isn’t entirely infallible.

Over the weekend we observed a small handful of errors in our services logs where it looked like SQS had a temporary problem.

We cross-checked this time period with other activity on Kerika, and determined that about 7 Kerika boards may have been affected: not in terms of any data loss or corruption on the board itself, but in terms of some changes not being updated in the search index.

Now, 7 boards is a tiny portion of the entire Kerika project database, which numbers in the hundreds of thousands of boards, but we are glad to have spotted the potential for trouble and have re-indexed the data on these particular boards.

If we did our job well, no one will notice.