Clock synchronization and Technology

The developers of Google’s Spanner database built an interesting system but chose to build their own clock synchronization technology instead of getting a more precise off the shelf solution and chose to make extensive use of the notoriously complex Paxos distributed consensus algorithm even though they were incorporating clock sync into their system. Whatever the rationale at Google, building a clock sync enabled high speed distributed transaction system is much less complex if you don’t make those choices.

Off the shelf technology that can synchronize clocks to under a microsecond in high-end systems and can even produce tight clock sync in the decidedly not optimal situation of cloud virtual machines is readily available (www.fsmlabs.com). And with clock sync, it is possible to enormously simplify distributed consensus. In fact, it’s hard to understand why Paxos is so widely used. Paxos is a highly complex (ingeniously complex) and painful to implement attempt to solve a problem that doesn’t exist.

The humorously titled “Paxos made simple” paper starts off discussing the problem of distributed consensus in completely asynchronous networks where timeouts won’t work and clocks are not synchronized at all. In practice, computer networks are not totally asynchronous, they do permit precise clock synchronization (and that technology has gotten much better since Paxos was introduced in 1998), and Paxos doesn’t even work in a totally asynchronous network. Page 7 of “Paxos made simple” points out that the algorithm can fail to converge on a solution, it is not “live”. Lamport, the author, then suggests electing a single coordinator or primary agent – using timeouts! Once you admit that timeouts work, you can replace Paxos with far simpler algorithms, some well known in database systems since the 1970s. Precise clock sync permits much more radical simplifications.

[ A spanner is British English for what we Americans call a “wrench”. ]