We had a problem today

Kerika was unavailable for about 15-20 minutes this morning; our apologies to everyone who was affected.

It’s back now, and we are still investigating the root cause.  All we know right now is our Amazon Web Services (AWS) Load Balancer, which acts as the immediate front-end to every browser that tries to connect to Kerika, reported a problem.


It is starting to look like Amazon Web Services was having an internal networking problem; our server’s error logs included entries like

[97207ms] ago, timed out [81830ms] ago, action [cluster:monitor/nodes/liveness], node [{#transport#-1}{thl-D8yeRmGg9N_4GyNNUQ}{elasticsearch}{}], id [133181]
06-07-2017 16:44:53.077 [ConsumerPrefetchThread-1] ERROR   com.amazon.sqs.javamessaging.AmazonSQSMessagingClientWrapper - AmazonClientException: receiveMessage.
com.amazonaws.AmazonClientException: Unable to execute HTTP request: sqs.us-east-1.amazonaws.com: System error

Restoring service was unexpectedly hard: we couldn’t reboot the AWS EC2 instances from the AWS Console, and couldn’t even login to the machine using ssh.

Eventually we had to power down our EC2 instances and restart them from scratch.  Not good.