Tag Archives: AWS

About our use of Amazon Web Services.

We will start to do IP blocking

Regrettably, we will start doing IP blocking to stop persistent spammers from setting up Kerika accounts.

We have seen a consistent pattern of misuse that goes like this:

  • Someone signs up with a sina.com email address.  Sina is one of the largest ISPs in China, but we don’t have any users in China for the simple reason that most of Google’s services are blocked by China’s Great Firewall, and Kerika has a tight integration with Google’s G Suite.
  • The spammer isn’t actually located in China; they are in Manila (Philippines) and come from IP addresses like 203.177.13.60
  • These spammers send out hundreds, sometimes thousands, of invitations for users from the qq.com domain to join their (spurious) Kerika team.
  • These recipients are all users of Tencent’s QQ messaging system, based in China. Again, none of them would be actual or potential Kerika users, since the recipients are all located in China.

The user impact of this spamming was relatively small: almost no one with a qq.com email address would reply to these invitations, but the conduct was a very clear misuse of Kerika and harmful to our reputation, quality and brand.

(Among other things, these spurious invitations would pile up in the thousands.)

We have decided, therefore, to start blocking IP addresses using Amazon’s VPC service (since we use Amazon AWS extensively on our back-end.)

This is a brute force method, and not ideal, but we were starting to get really annoyed with these folks.  We hope this doesn’t impact any of our real users in the Philippines; if you are affected, please let us know!

Wrestling with SSL certificates

We had previously been using SSL certificates for our website (kerika.com) and this blog (which is on a subdomain: blog.kerika.com) that we got from GoDaddy, but we have moved away from them.

What pushed us away was their aggressive approach to billing customers: they automatically renewed our SSL certificates after just 9 months into a 12-month contract, which we found unacceptable.  Talking to their customer service people was an unhappy experience as well, so we decided to not do any more business with GoDaddy.

Now we are using a SSL from Amazon for our website and app (kerika.com): Amazon actually provides free SSL certificates to sites hosted on Amazon Web Services, and it was easy and simple to set up.

However, AWS doesn’t provide wildcard SSL certificates so we couldn’t handle our blog as well — particularly as our blog isn’t hosted at AWS. Instead we got a SSL certificate for the blog from RapidSSL which is reasonably priced.

So far, so good.

We had a problem today

Kerika was unavailable for about 15-20 minutes this morning; our apologies to everyone who was affected.

It’s back now, and we are still investigating the root cause.  All we know right now is our Amazon Web Services (AWS) Load Balancer, which acts as the immediate front-end to every browser that tries to connect to Kerika, reported a problem.

UPDATE:

It is starting to look like Amazon Web Services was having an internal networking problem; our server’s error logs included entries like

[97207ms] ago, timed out [81830ms] ago, action [cluster:monitor/nodes/liveness], node [{#transport#-1}{thl-D8yeRmGg9N_4GyNNUQ}{elasticsearch}{172.18.0.12:9300}], id [133181]
06-07-2017 16:44:53.077 [ConsumerPrefetchThread-1] ERROR   com.amazon.sqs.javamessaging.AmazonSQSMessagingClientWrapper - AmazonClientException: receiveMessage.
com.amazonaws.AmazonClientException: Unable to execute HTTP request: sqs.us-east-1.amazonaws.com: System error

Restoring service was unexpectedly hard: we couldn’t reboot the AWS EC2 instances from the AWS Console, and couldn’t even login to the machine using ssh.

Eventually we had to power down our EC2 instances and restart them from scratch.  Not good.

Amazon burped a little on the weekend

We use a number of Amazon Web Services, including one called Simple Queue Service which Kerika uses to handle communications between our main project database server and a separate server that handles the Search function.

  • As with all search engines, Kerika’s Solr engine does a full indexing of the database only once: when the database is rebuilt for any reason (which happens very rarely), and after that it does incremental indexing which means that it only looks at changes made to individual boards, cards, and canvases.
  • Using a queue helps us manage the load of traffic going to the search engine server: in the unlikely event that a lot of people make a lot of updates to their Kerika boards at the same time, Solr won’t get overwhelmed with a sudden burst of new indexing.
  • There are lots of ways to implement queues in software — in fact, studying queuing theory is a standard course in all computer science programs — and at this point most apps, like Kerika, prefer not re-invent that particular wheel: instead, it is more cost-effective to use some standard queuing facility that’s available as part of the underlying platform.

AWS works very well in our opinion — it has very high reliability across most of its services — but like all software, it isn’t entirely infallible.

Over the weekend we observed a small handful of errors in our services logs where it looked like SQS had a temporary problem.

We cross-checked this time period with other activity on Kerika, and determined that about 7 Kerika boards may have been affected: not in terms of any data loss or corruption on the board itself, but in terms of some changes not being updated in the search index.

Now, 7 boards is a tiny portion of the entire Kerika project database, which numbers in the hundreds of thousands of boards, but we are glad to have spotted the potential for trouble and have re-indexed the data on these particular boards.

If we did our job well, no one will notice.

Lean Government & Holacracy: the Video

Here’s an hour-long video of Michael DeAngelo‘s presentation on Lean Government & Holacracy in Washington State:

Highlights from his talk:

On the Office of the CIO

  • Roughly 4,000 IT professionals in Washington State.
  • About 80 agencies run their own IT teams.
  • Office of CIO sets strategy and provides oversight.
  • Transform government through technology and culture.
  • Created the small business hub: business.wa.gov
    • Run as a Scrum project, with 1-week Sprints.
    • Adopted customer-driven design.
    • Successful example of using Lean Startup methodology.
  • Driving the use of Software-as-a-Service (SaaS).
  • Practice and “open office” style.

On Lean Government

  • Started in Washington State with Governor Christine Gregoire.
  • All agencies are required to have a Lean focus.
  • Challenge: how to be an “employer of choice” for IT professionals, given stiff competition from Amazon, Microsoft, etc.
  • Several agencies have an active Agile/Scrum practice, but this is still in pockets within state government.
    • Office of the CIO
    • Department of Licensing
    • Department of Labor & Industries
  • Impediments to adopting Agile:
    • Having the right tools
    • Having the right sort of contracts
  • Agencies adopting Agile are largely implementing this in a software development context.
  • Developing the Agile QA Scorecard.
  • Developing Agile Procurement for more flexible contracts with vendors.

On Holacracy

  • Goal: empower employees to organize themselves.
  • There are no managers.
  • Washington State is first government anywhere to practice holacracy.
  • Washington State is also the first organization anywhere with a represented workforce (i.e. with employee unions) to practice holacracy.
  • Doing an A/B test of holacracy vs. hierarchical organization, in cooperation with Harvard Business School.
  • Hypothesis of A/B test: self-organizing teams will produce better employee outcomes.
  • Measure for a year and see what the results are.
  • Looking for three categories of metrics:
    • Are employees more engaged, with better retention?
    • Are there better customer outcomes, where “customers” are other agencies?
    • To what extent is an organization practicing holacracy more able to achieve larger organizational objectives
  • Instead of managers, there are roles that are assigned certain accountabilities.
  • Holacracy and Agile have things in common:
    • Bias towards action
    • Be iterative
    • Don’t make up all these demons that might show; see if they actually appear
  • Holacracy and Agile are different:
    • Holacracy isn’t about getting buy-in on your ideas from the team.
  • The Scrum roles, e.g. Product Owner, Scrum Master can be added as holacracy roles in a particular circle.

Quotes

“The reality is, a lot of the cloud providers can provide better security solutions than we can afford internally.”

“For us, cloud is actually one of the strategies for increasing security for the state.”

“The interesting question is, how do you do oversight and QA — really project management QA, not just traditional software QA — in an agile context?”

“One of the metrics for Agile QA: is the business engaged?” (Not just steering committees like before, but do we really have engaged product owners.)

“The contracts and procurement shop in state government practice what they call XP — Extreme Procurement”

“Washington is the only state to practice Agile Procurement and Agile Contracting”

“Downside of holacracy: everyone loves to tell me that I am not the boss of them”

“No government has ever practiced holacracy before.”

“Holacracy has never been practiced with a represented workforce before. (One with employee unions.)”

“I have been practicing holocracy for a few months, and I feel like I have a different set of lenses through which I look at work.”

“When I talk to people who are not practicing holacracy, I see evil spirits around them, like bureaucracy, office politics, inefficient meetings…”

“We develop these habits to compensate for the deficiencies of a hierarchical organization, instead of trying to change it, and this is after thousands of years of evolution.”

“The team has to want it: you need opt-in for holacracy to work.”

“Imagine trying to play soccer with a hierarchical organization, where the team is run by managers who are responsible for different sections of the field.”

“Because I am the manager, you need to always pass the ball to me. Ridiculous as that seems, that’s how hierarchical organizations work.”

“90% of my time is spent on crap that runs government work, and that’s because of the authority of my position.”

“As a manager I don’t have a passion for a lot of things, but other people might, so I want to give them the authority to take them on.”

“Healthy habits in a dysfunctional system become unhealthy habits in a functional system.”

“In holacracy, you quickly learn what makes for a valid objection.”

“The type of people who would not respond well to holacracy are managers that derive their self-worth on span of control.”

“There’s a category of employees who have no interest in being self-directed: they just want to be told what to do.”

Security within a Virtual Private Network

All of Kerika’s servers, which run on Amazon Web Services (AWS), operate within a Virtual Private Network (VPN), so they can be configured to only listen on local ports, e.g. ports like 10.0.0.1, etc.

This means that they cannot be accessed directly from the Internet: instead, all connections are routed through an Elastic Load Balancer (ELB), which is a special kind of AWS server that handles connections from all users.

The ELB is very secure: it implements SSL 2.0, and when vulnerabilities like Heartbleed and POODLE are discovered, it is relatively easy for us, with Amazon’s help, to quickly ensure that the the ELBs are patched.  Patching the ELBs quickly gives us breathing room to patch all the other servers involved, particularly if vulnerabilities are found at the platform level itself.

But, running a VPN isn’t enough: while it blocks people outside the Kerika server environment from directly accessing our database, there is still — at least a theoretical possibility — that an attacker can find his way inside the VPN, and then try to connect to our database server on a local port.

To avoid this scenario, we use SSL within the VPN as well, so that the connections from the load balancers to the database servers are also authenticated and encrypted.

Amazon Fire tablet experience: surprisingly good

We were trying out Kerika using Amazon’s Silk browser on one of their Fire (color) tablets, and found that Kerika worked surprisingly well.

On standard (un-forked) Android tablets, the Chrome browser works better than the standard browser that comes with all tablets, mainly because Google has been improving Chrome with a lot more enthusiasm than they have been improving “stock Android“.

So, we weren’t sure how good the Silk browser would behave with Kerika, given that Silk is a relatively old fork of the standard Android browser.

It turns out that you can use Kerika on Amazon’s Fire tablets quite well: just open the Silk browser, go to kerika.com, and login like you would on a laptop or desktop. Just let your finger do the dragging-and-dropping…

Using Kerika, but not using English

Right now, the Kerika user interface is entirely in English, but we have users worldwide and many of them use Kerika with other languages, e.g. Greek, Japanese, Korean, etc.

When you export data from a Task Board or Scrum Board that includes non-English characters, the foreign characters are actually preserved correctly as part of the exported data, but if you need to then import data into some other program, like Microsoft Word or Excel, you need to make sure the other program correctly correctly interprets the text as being in UTF-8 format.

WHY UTF-8?

UTF-8 is a coding standard that can handle all possible characters, so it works with languages like Greek, Japanese, etc. which don’t use the Roman alphabet.

For a long time now, UTF-8 has been the only global standard that works across all languages, because of its inherent flexibility in handling different character sets.

When you do an export of data from a Kerika Task Board or Scrum Board, we create the CSV files in UTF-8 format, and include what’s called the Byte Order Mark (BOM) in the first octect of the exported file.

Including a BOM is the best way to let all kinds of third-party programs know that the file is encoding in UTF-8: it’s a standard way of saying to other programs, “Hey, guys! This text may contain non-English characters.”

And for the most part, including a BOM works just fine with CSV exports from Kerika: Google Spreadsheets interprets that correctly, Microsoft Excel on Windows interprets that correctly, but not…

EXCEL ON MACS

Many version of Excel for Macs, going back to Office 2007 at least, have a bug that doesn’t correctly process the BOM character. Why this bug persisted for so long is a mystery, but there we are…

The effect of this bug is that an exported file from Kerika, containing non-English characters, will not display correctly inside Excel on Mac, although it will display correctly with other Mac programs, like the simple Text Edit.

There’s not much we can do about this bug, unfortunately.

THE TECHNICAL BACKGROUND TO ALL THIS:

BOMs are used signify what’s called the “endianess” of the file.

Endianess is a really ancient concept: in fact, most software developers who learned programming in the last couple of decades have no idea what this is about.  You can learn about endianess from Wikipedia; the short summary is that when 8-bit bytes are combined to make words, e.g. for 32-bit or 64-bit microprocessors, different manufacturers had adopted one of two conventions for organizing these bytes.

For Big-Endian systems the most significant byte was in the smallest address space, for Little-Endian systems the most significant byte was in the largest address space.

(If you have a number like 12345, for example, the “1” is the most significant digit and the “5” is the least significant. In a Big-Endian system this would be stored as “1 2 3 4 5”; in a Little-Endian system it would be stored as “5 4 3 2 1”. So, when you get presented with any number, you really need to know which of the two systems you are using, because the interpretation of the same digits would be wildly different.)

(About a dozen years ago Joel Spolsky, former PM for Excel, wrote a great article on the origins and use of BOM, for those who want to learn more about the technical details.)

Why this affects Kerika at all? Because when you do an export of cards from Kerika, the export job is run on a virtual machine running on Amazon Web Services.

We have no idea what kind of physical hardware is being used by AWS, and we are not supposed to care either: we shouldn’t have to worry about whether we are generating the CSV file using a little- or big-endian machine, and whether the user is going to open that file with a little- or big-endian machine.

That’s the whole point of using UTF-8 and a BOM: to make it possible for files to be more universally shared.

Getting rid of a pesky “Mixed Content” warning

When you first use Kerika, your browser has a reassuring sign that your connection to our servers is being encrypted:

No warning when you first use Kerika
No warning when you first use Kerika

But as soon as you open a card that contains any attachments, e.g. files stored in your Box account if you are using Kerika+Box, this reassurance would disappear, and instead you would see a warning about “Mixed Content”, which basically means that some of the data shown on your Kerika page was coming from a source that wasn’t using HTTPS.

Why there is a mixed content warning
Why there is a mixed content warning

This was because of a small bug in how we were dealing with the thumbnails we got for files stored in your Google or Box account: for faster performance we were caching these on our own Amazon S3 cloud storage (so we wouldn’t have to keep getting them from Google/Box every time you open the same card.)

It turns out that we weren’t fetching the thumbnails from S3 using HTTPS, which meant that as soon as you switched to the Attachment view of a card, your browser’s address bar would show the “mixed content” warning.

There was no real vulnerability resulting from this, but it did interfere with the user experience for that minority of users who like to keep a sharp eye on their browser’s address bar so we have fixed that with our latest release.

Now you should always have the warm reassurance of seeing the green secure site symbol on your browser when you open a card!

Upgrading our server infrastructure

We had problems occasionally with our servers running reliably, and if you were unlucky you may have noticed this.

Well, it took a very, very long time but we have finally figured out what’s happening.

It turns out that the garbage collection function on the Java Virtual Machine that runs our server software (all on a Linux virtual machine running on Amazon Web Service) was having problems: most of the time the garbage collection runs just fine on an incremental basis, taking up only a fraction of a second of CPU time, and periodically the JVM would do a full garbage collection as well.

The problem we are facing is that sometimes this full garbage collection would fail and immediately restart.

In the worst-case scenario, the garbage collection would start, fail and restart over and over again, until the server basically thrashed.  And each time the full garbage collection ran, it took 5-7 seconds of CPU time (which is a really long time, btw).

We are trying to understand the best long-term solution for this, but in the short-term we can mitigate the problem in a variety of ways, including upgrading our server virtual machines to have more RAM.

One reason it took so long to debug is that we were chasing a red herring: we had noticed network spikes happening frequently, and we wrongly assumed these were correlated to the server’s CPU load spiking, but this turned out to be coincidental rather than causal.

Sorry about all this.