ITNEXT

ITNEXT is a platform for IT developers & software engineers to share knowledge, connect, collaborate, learn and experience next-gen technologies.

Follow publication

CAP and us

--

The CAP theorem (for better or worse) has ruled the landscape when discussing distributed consistency for decades. As any model, it provides a simplified view of the real world, in this case the consequences of network latency and outages and its effects on data consistency in distributed computer systems. Its basic premise is simple; there are nodes which process inputs into some outputs, the results of which have to be communicated to some other nodes over a (lossy) computer network. It seems common sense to say that this is a perfect approximation of any computer network, so its conclusions have been widely accepted to be practically applicable whenever discussing various consistency models in distributed systems.

The original paper is a lot more detailed and precise, but for our purposes this, (much simplified) view of their model will suffice.

CAP and its applicability in practice

The arguments people had against CAP being used this way were mostly discussing its strictness and real-life applicability. For example, Martin Kleppmann wrote a detailed blog post, asking people to stop using CAP as a shorthand years ago, to little effect. His arguments mostly lay around the definitions (like consistency meaning linearizability) and how their model is a poor fit (in his view) for the real-world problems present in distributed computer systems in general.

For what we’re setting out to prove here, the most useful quote is one from the original author of the whole concept, Eric Brewer. In his article: CAP Twelve Years Later: How the “Rules” Have Changed he lays out the argument that the simplified view of the theory used in this way is unhelpful. He also goes on to talk about how latency shouldn’t be ignored when discussing CAP (a point Kleppman also makes):

In its classic interpretation, the CAP theorem ignores latency, although in practice, latency and partitions are deeply related. Operationally, the essence of CAP takes place during a timeout, a period when the program must make a fundamental decision-the partition decision:

- cancel the operation and thus decrease availability, or

- proceed with the operation and thus risk inconsistency.

Retrying communication to achieve consistency, for example, via Paxos or a two-phase commit, just delays the decision. At some point the program must make the decision; retrying communication indefinitely is in essence choosing C over A.

Thus, pragmatically, a partition is a time bound on communication. Failing to achieve consistency within the time bound implies a partition and thus a choice between C and A for this operation. These concepts capture the core design issue with regard to latency: are two sides moving forward without communication?

We use Brewer’s reasoning to talk about the applicability of CAP via latency, which is a basic fact present in computer networks, everyone can identify with.

We define network partitions to belong to the same group as any other timeout, where the timeout is very large. For our purposes, the case we’re discussing also covers this eventuality, so even though a distinction is often made, we don’t feel that to be necessary.

We follow Brewer’s lead, when he says that if you address latency-based timeouts, the CAP model no longer applies to you since (as he implies) the timeout IS the CAP decision. That however, seems impossible, as long as we need to move some information between different computers over a latency-prone network.

Nodes and networks

At its simplest, a network is two computers, A and B, sending packets between each other.

A — stuff→ B

These computers are present in some connected environment, for this argument, say that A and B live in different datacenters. They communicate via some sequence of technical equipment between them, abstracted here to mean some connection which can be expressed via some probability distribution that a piece of information will arrive within some set time.

Examples of latency distributions on a wired network. Starting at the upper left, the graphs describe the latency distribution on a local Ethernet link on servers with no, low, medium, and high load respectively.
Steinert, Rebecca & Gillblad, Daniel. (2021). An initial approach to distributed adaptive fault-handling in networked systems.

In the above graphs, one can observe how we can expect most information to arrive within some reasonable, foreseeable time window, but also that tail latency makes planning for all possibilities impossible in practice. We simply can’t expect to receive all information at the other end in good time.

Brewer goes as far as to say in a later paper that Google Spanner is “CAP in practice” (paraphrasing) as they can provide strong enough guarantees that latency spikes won’t occur over their network, which was both built and maintained by them.

This isn’t helpful if you’re either planning on running a SQL-compatible database (especially your own) or expect better availability than the “5 9s” cited in the paper. In Spanner, new information is voted on in the quorum replicas, which is another way of saying that every information (originating from a single computer) has to be received by all the participating computers. We’re not here to analyse the merits or shortcomings of this model, but it’s clear that the initial model of moving information between computers over a noisy network applies.

What lies beyond

Another way to look at the model we started with is that for each piece of data a single node (computer) can be identified. Each node emits information to other nodes, and a path must be chosen on the network for this information to arrive at its destination.

But what if the nodes on the network are at multiple places at the same time? What if each node can have multiple replicas living on the same network, emitting the same information (when looked at in sequence), sending them to multiple nodes on the network? We define a lost message as (statistically speaking) the worst offender in a group of 100 messages.

Say, we have 3 replicas of a node which (for now, magically) have the same set of inputs. They are each sending their messages to the same 3 nodes (also replicated), with the same message, but over different arrays of hardware. The chances of a message being lost in this case would be 100⁹, so 1: 1 000 000 000 000 000 000. If the original 1:100 message-loss event occurred once each second, we can now expect it to happen roughly once every ~320 million years. Adding more replicas increases this number exponentially.

The omission here is how these nodes can share the same ordered set of inputs. We simply say that the 3 targeted nodes are actually members of a consensus group, which provide the inputs for their worker-nodes. That way each input can be received by separate computers and all information can be sent by multiple computers at the same time (as long as they are idempotent, so can be filtered for these duplicates).

This approach allows a much more robust, scalable resilience-model than Spanner’s ‘5 9s’ and can be run over any arbitrary computer network.

Sign up to discover human stories that deepen your understanding of the world.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Published in ITNEXT

ITNEXT is a platform for IT developers & software engineers to share knowledge, connect, collaborate, learn and experience next-gen technologies.

Written by Andras Gerlits

Writing about distributed consistency. Also founded a company called omniledger.io that helps others with distributed consistency.

No responses yet

Write a response