Visualizing time data in the EC2 cloud
As applications migrate to the cloud to take advantage of scalable resources and lower costs, a great deal of attention has been focused on coordination between cloud “instances” (such as virtual machines) that are dynamically created to work on the same problem. One of the key issues is synchronizing the clocks between these instances.This issue is particularly important for databases where timestamps are used to reduce contention for locks while still providing for data integrity. Using the ubiquitous free software time synchronization clients, clocks on instances can diverge by multiple milliseconds or more within a short time.
In contrast, a very basic TimeKeeper configuration where one instance runs TimeKeeper Server and provides time to a collection of instances running TimeKeeper Client can easily get clocks locked to within 25 microseconds of each other – good enough for a large number of applications. Although there are ways to get better synchronization from TimeKeeper in the Cloud, this basic configuration is orders of magnitude better than results described in the extensive technical literature on Cloud Time Synchronization despite its simplicity.
The reasons why TimeKeeper works and traditional technology does not come through clearly in just two graphics from the TimeKeeper Data Analysis and Visualization tools.
The first graphic provides the accuracy of ten reference time feeds from the Amazon Cloud NTP time pool. The pool consists of reference time sources that are connected to the official time coming from the National Institute of Standards and Technology (NIST). There are only four distinct pool servers in EC2, but this test open connections to all four, then opens a second connection to all four, and then a third connection to two, to get a total of 10 streams. You can see at once that the streams are offset from each other by milliseconds and that they exhibit a lot of spiking behavior (the vertical scale is 5milliseconds per division).
The immediate insight is that two virtual machines getting time from the same pool source can be easily more than 5 milliseconds apart generally and a lot more if they do not correctly filter out spikes (which requires far more sophisticated filtering than can be found in free time client software). The graphic also shows one reason why it is so hard to understand cloud timing – if you run the same experiment twice, you can get wildly different behavior because of wildly different results from the time source. Even a standard tool like TCPdump will produce different results if the times are so significantly off. On the good side, the times sources are quite steady – deviations from correct time do not last long so where there is smart filtering technology, precise frequency can be determined.
But why are the times so variable in the first place? Part of it is the nature of the cloud environment where network congestion is so variable and part of it is the nature of virtual machines which have to share physical hardware. TimeKeeper’s smart filtering and slewing (catching up smoothly when a virtual machine restarts operation) make a big difference in both of these, but there is a third cause which would be invisible without a second graphic. A TimeMap is obtained by interrogating all connected time sources visible and then tracing all of those sources back to their sources. The instance is in the center of this graph with 10 adjacent nodes corresponding to the ten time sources. The first surprise in this graph is that the pool source nodes are actually pretty far away from the ultimate sources (blue edges). Those ultimate sources vary a great deal in quality – ranging from high quality GPS time (from the satellites) to legacy modem based sources (ACTS in the bottom right). Intermediate nodes that have multiple sources themselves can switch back and forth in ways that are, at best, unclear. So the pool sources are nowhere near as simple and reliable as one might assume. That’s why the simple configuration of a TimeKeeper Server instance serving time to a collection of clients produces such dramatic improvement.
For more information please email email@example.com
A PDF version of data visualization in the cloud.