How to tell if a file has been updated

The first version of Kerika was written as a peer-to-peer (p2p) application, so one challenge we faced was detecting when files that are being shared as part of a project were changed by a user, so that we could send the latest version to everyone else on the team.

Our first attempt at a solution was to simply examine the Last Modified time for files. However, this proved to be very unreliable for a rather odd reason: whenever you open a spreadsheet using Microsoft Excel, it automatically updates the Last Modified time to be the current time – even before you had made any changes.

And when you close Excel, without having made any changes, it resets the Last Modified time back to its original value. So, whenever you opened a Excel files for viewing, we would erroneously identify it as an updated file.

We then tried looking at the size of files, to see if these had changed since we last examined them. We knew, of course, that this would be error prone in its own way: if you change some text within a file such that it contains the same number of characters as before, the overall size of that file would not change.

But this approach failed for another reason altogether: Microsoft Word allocates disk space in chunks at a time, rather than as exact amounts. This means that any edits to Word files that do not require Word to grab another chunk, or give up a chunk, would never be reflected in the reported size of the file.

Eventually, we decided to take the MD5 hash of files, which is a more reliable way of detecting if a file has been modified. We were concerned about how much CPU overhead this would take, but it proved to not be a problem after all.