Solved Problems in Clock Synchronization

When we started working on time (clock) synchronization in 2006/2007, the people pushing the envelope in applications were exposing requirements for levels of accuracy that were unprecedented outside of telecommunications. The high frequency traders and automated traders on Wall Street were talking about microseconds and even nanoseconds, while general state of the art was measured in milliseconds or seconds. The first versions of TimeKeeper blew the doors off on accuracy. But we soon came to the conclusion that high accuracy was the easy part. Fault-tolerance and automatic alerting were critical and difficult. It turned out that silent Falling off the Clock and often spectacular failures in time synchronization were not at all unusual in the field. Time synchronization is fragile and neither the standard protocols (NTP or PTP) or the ubiquitous free software implementations (NTPd, PTPd and many variants) even began to address reliability seriously. Part of the issue is that fault tolerance and reliability in enterprise environments is not the same problem as in the domains for which NTP and PTP were developed. I remember giving a talk on how TimeKeeper fault-tolerance works in a technical conference a couple years back and being berated by one of the audience members for not using the “best master” part of the PTPv2 protocol which, he insisted, solved the problem. But “best master” depends on the time source telling the client how good the time is at the client — something that it has no way to guess without a lot of information not carried in the basic protocol.

TimeKeeper fault-tolerance is based on a mix of completely novel approaches and is built into the architecture of the system. Among the techniques we use is to manage multiple sources by managing an ideal or smart clock for each of them independently. The smart clock is a software clock that takes inputs from a reference time source (over the network or via GPS or IRIG or some other clock signal) and then tries to compute actual time using one or more oscillators and a database of information about the reference source and the oscillators and the general health of the device. We use all that information to decide when to failover from one source to another — and we are totally agnostic about the protocols used by the source. We extract accurate time from any reference source - GPS, PTP, NTP. Over the years of supporting the product in the field, we have had time to refine algorithms in light of experience. It works. And it is somewhat amusing to see how developers of alternative time synchronization packages are scrambling to glue fault tolerance on to their products and how standards bodies are suddenly trying to come up with mechanisms for overcoming the fragility of their protocols.