Sunday, September 07, 2008

A CAP Solution (Proving Brewer Wrong)

One of the latest challenges in computer science seems to be the CAP theorem. It addresses a perceived impossibility of building large-scale and clustered (web) service architectures. The fact that it (supposedly) has been proven to be true makes what I am going to write here all the more unlikely. Still, read on because I will show that I am right and CAP is not an impossibility after all... While the impossibility proof of CAP is mathematically correct, it is based on assumptions that are too strict. By relaxing these assumptions, I found the solution presented here.

What is CAP?

The CAP theorem (short for consistency, availability, partition-tolerant) essentially states that you cannot have a clustered system that supports all of the following three qualities:


Consistency

Consistency is a quality meaning (informally speaking) that reads and writes happen correctly. In other words, the overall effect of executing thousands or millions of transactions concurrently is the same as if they had been executed one-at-a-time. Usually, this is done with the help of a transaction manager of some sort.

Availability

Availability essentially means that every operation (that makes it to a non-failing node) eventually returns a result.

Partition-tolerant


This quality refers to the possibility of tolerating partitions on the network. Note that we suppose a cluster architecture (which is where the network comes in).



CAP is a conjecture originally formulated by Eric Brewer (Inktomi) and has influenced many of today's larger-scale websites like . In other words, the impact of CAP is very large. To make it worse, the perceived impossibility of a CAP system (one that has all three desirable properties) has lead people to advocate something called BASE (Basically Available, Soft-state and Eventually Consistent) - see this talk by Werner Vogels (CTO at Amazon).

As far as I know (but I could be wrong), a theoretical foundation of BASE does not exist yet (it seems more of an informal approach which to me raises serious questions concerning correctness). In this post I will present:


  • a CAP solution

  • how this conforms to what BASE wants to achieve

  • a "design pattern" for building correct systems that (in a way) offer both CAP and BASE qualities



Because CAP is perceived as impossible and because BASE lacks formal treatment, I consider this to be a signification contribution to the state of today's engineering;-)

What about the proof of Brewer's theorem?

Brewer's proof has been published by Nancy Lynch et al and discussed by me (see my earlier post and also this one).

While the theoretical proof of the impossibility of CAP is valid, it has a big limitation: it assumes that all three CAP properties have to be supplied at the same moment in time. If you drop this assumption, then all of a sudden you get into a new spectrum of possibilities. This is what I will do here.

A CAP solution

Enough talk, let's get to the core of the matter. Here is my solution to CAP. To make it concrete, I will use the concept of a web-shop like Amazon. Here are the rules that are sufficient to ensure CAP:


  1. Process reads from the database if possible, or use a cached value if needed for availability (if the DB is unreachable).

  2. All reads use versioning or another mechanism that allows optimistic locking.

  3. Updates supplied by clients (orders in case of Amazon) are queued for execution, and include the versioning information of the reads that lead to the update.

  4. Queued updates are processed when the number of partitions is low enough to do so. The easiest way to do this is with a cluster-wide distributed transaction across all replicas (more on scalability later), but other more refined ways are possible (such as quorum-based replication or any other smart way of replicating). The version information in the update is used to validate it: if the data in the database has been modified since the original read(s) that lead to the update, the update is rejected and a cancellation is reported back to the client. Otherwise the order is processed and a confirmation is reported back to the client.

  5. The results (confirmation or cancellation) are sent asynchronously to the clients. This can be either email, message queuing, or any other asynchronous delivery method.



That's it. Adhere to these guidelines, and you have a CAP architecture. I will not provide a formal proof here (I intend to do that elsewhere, in a research paper), but intuitively the proof is as follows:


  • This system is consistent because reads are based on snapshots and incorrect updates are rejected before they are applied. In other words: there are no incorrect executions.

  • This system is available since reads always return a value, and so do writes (even though they are queued and it may take a while).

  • This system is partition-tolerant because it allows network and node failures.



Granted, this system does not provide all three at the same moment in time (which is how we go around the impossibility), but nevertheless the result is quite strong IMHO.


The limitations

There are some limitations to this solution - all of which seem reasonable:

  1. Read-only requests may be presented with stale information (due to updates that have yet-to-be-applied). In that sense, their results could be "inconsistent": for instance, the availability of an Amazon item can change between two page views. I do not see this as a major restriction, since no website that I know of will offer read consistency for the duration of a user session. It all depends on what you consider to be within the scope of one transaction;-) Note that this almost corresponds to snapshot isolation found in Oracle.

  2. Partitions should not last forever: in order for this to work, partitions should be resolved within a reasonable time (reasonable being: within the expected confirmation time for updates). The duration of any partitions also affects the time window in which reads can produce stale data.

  3. The updates have to be applied in the same relative order at all cluster nodes. This puts some restrictions on the algorithm used to do this.



Note that updates are always based on correct reads thanks to the versioning check before they are applied. So update transactions are always consistent.


How does this relate to BASE?

You could see this as a design pattern for BASE if you like. The solution adheres to BASE in the sense that it uses cached reads (if needed) and that the updates are delayed (so you could say they are "eventually" applied and the system becomes "consistent").

Reflections in scalability

So far the CAP focus was on possibility. I think my solution shows that it is possible. Now how about scaling up?

The naive solution (a huge distributed transaction to update all cluster nodes in-sync) is unlikely to scale: as you add more nodes, more updates are needed. Now I am a big fan of transactions, but not to use them in an arbitrary matter. So how to propagate these updates through the cluster?

While smarter solutions for this exist (such as the work by Bettina Kemme), a trivial first try would be to push updates (lazily) to all nodes in the cluster. This can be done with a smart queuing mechanism. The disadvantage is that updates are not applied everywhere at once (rather, the all-or-nothing quality just "ripples" through the system). So you get into the "eventually" style again.

Note that this latter suggestion makes the system behave much like the READ COMMITTED isolation level (which, by the way, is the default in Oracle). So this approach sacrifices consistency/isolation a bit in favor of scalability.


Future work

Additional research could/should be done in the following areas:


  • Improving read consistency through session affinity

  • The best way to push the updates through the cluster

  • Performance evaluation in real life implementations



Final note and disclaimer

I did not see Brewer's original presentation of the CAP theorem - so it could be that what he meant with consistency also involved all reads (see the limitations of the solution I presented here). In that case I did not find a solution for CAP but at least it is a framework and proof outline for BASE ;-)

11 Comments:

Anonymous Anonymous said...

Hello Guy,

Thanks for your very interesting papers on transaction. The interest of the CAP theorem is to say that Consistency, Availability and partitioning cannot be reach perfectly AT THE SAME TIME. More the partitioning is important more the CAP theorem becomes accurate. I have been working for many years on distributed cache issues and when our customers want to share datacache between London, New York and Tokyo, the CAP theorem becomes fundamental. It is easy to say we can get C.A.P at the same time when all your device are at the same place and the Partitioning is weak.

Best regards

Paul Perez

3:10 PM  
Blogger Guy said...

Hi Paul,

Thanks. You are right, the important notion seems to be "at the same time". It's just that that notion seems flexible if you have the option to be asynchronous (which I assume is not easily the case for distributed caches).

However, I do think it is possible to present a suite of algorithms that offer different characteristics and qualities of service. That is what I intend to do in the near future - if time allows.

6:10 PM  
Anonymous Anonymous said...

Hi Guy,
I don't like to be anonymous (anonymous said...) My name is Paul Perez, my email is paul.perez@pymma.com.

As you said The key point on CAP theorem is the "At the same time" It looks like the Heisenberg uncertainty principle, more you approach quantum size more the Heisenberg principle is accurate. With the CAP theorem, more Partitioned is your system, less you can get A&P at the same time.

Best regards

PPe

I had like you pattern on TCC. I try to implement it with BPEL orchestration. Are there technical sample on that topic. thanks (on my email please)

1:24 PM  
Blogger Unknown said...

Hi Guy,

I was curious, how realistic would this solution be? As I understand it, when you process an update from the queue, the version of the database changes, and hence all the other updates in the queue (who would have a different version number) are rejected, so you would have a lot of rejected updates, which would mean a lot of emails to send to customers to tell them to place their order again. Or am I misunderstanding something?

Regards

2:08 PM  
Blogger Guy said...

Hi,

Actually, what I propose is similar to optimistic locking (used in Hibernate, and in most web applications anyway): updates in the queue are indeed based on some "snapshot" of the data and are applied afterwards (when that data might have changed already). This is also how many web applications work - only there the HTTP session plays the role of the 'queue'.

This technique works pretty well except if you have 'hot spot' data - meaning data that changes extremely frequent due to high concurrency. There has been extensive research on this (just google for "optimistic locking" - it has been a while since my research days;-)

Guy

4:14 PM  
Anonymous Anonymous said...

Congratulations, you've re-invented Dynamo, badly. (Hint: read up on Vector Clocks, you can resolve a lot more conflicts automatically).

Your system serves stale data some of the time. Therefore, it DOES NOT HAVE the Consistency property. Period. Claiming that it does is silly.

Your system is no different from Amazon Dynamo: Sometimes, the data you serve is consistent. Sometimes, the data Dynamo serves is consistent. Both systems are 100% C+A+P compliant if you wait long enough between updates. Yawn.


You can't write a banking application on a "mostly"-consistent database. (It will lie about balances, and still sometimes go down when you need cash. But at least it will text me when it comes back up again!)

3:47 AM  
Anonymous Enrico Weigelt, metux IT service said...

Hi folks,

IMHO the old wisdom of divide et impera is a major key here.

In 99% cases you don't need perfect consistency of any data object against each other.

Take the shop example:

A subsystem for user's product rating comments does not need to care about all hotspot data like product availability. It just has to serve existing ratings to some given product (by a global unique identifier) and accept new ratings.
So let it be an completely separate service, called by the frontend. Whoops, took away a lot of load from the primary database.

And so it goes on, case by case.

Of course, not an universal solution, but at least an solution, an path to go.


--
Enrico Weigelt,
metux IT service
http://www.metux.de/

1:42 AM  
Blogger Unknown said...

The interesting point in your article is the way you resolve partitionning with stale time. Partitionning is basically a moment in time that two system cannot communicate. This always happen at some level, for exemple between two clock cycles. It's when the interval get greater than the transaction interval that problems happen and that we can talk about partitionning. If you allow transactions to happen only in longer interval than the partitionning period, partitionning disappeared and the CAP doesn't apply anymore.

Briefly, if I said that I'm making one transaction a day and that all nodes have disconnection periods of 6 hours or less, is it sufficient to prove that my system is CAP perfect ? I think not.

12:14 AM  
Blogger Unknown said...

Lynch's proof strikes me as irrelevant to actual system design where timeouts and probabilities are fundamentals. Since we are executing on a system with a non-zero probability of undetected ECC error, we cannot do better than that. But since availability is never instantaneous, even on good days, it's reasonable to trade off probabilities of delays with probabilities of failure.

4:37 AM  
Anonymous Simon said...

While this may be a good solution from a practical standpoint it does not address the CAP theorem from a formal perspective. The theorem was presented in this context and can really only be challenged in this context. To point out a few of the problems with this solution - you assume that it's okay if partitions form provided they are temporary and likewise it is assumed to be fine for availability to cease temporarily. However, according to the formal model of Gilbert and Lynch this is not permissible. Any delay, however long, violates the availability constraint.

From a practical perspective your solution does seem fairly tolerant of real-world failures, but it doesn't address the CAP theorem according to the formal, theoretical model it was formulated in.

7:45 AM  
Blogger carlos aya said...

Hi Guy,

Your approach reminds me a little about STM (Software Transactional Memory) in Haskell and other languages. Have you read about this? Any thoughts?

cheers,
Carlos

5:52 AM  

Post a Comment

<< Home