“When I use a word,” Humpty Dumpty said, in rather a scornful tone, “it means just what I choose it to mean—neither more nor less.” -- Lewis Carroll's Through the Looking Glass
I've recently been trying to understand more about these "NoSQL" systems, and how they work.
One interesting question is what they mean by "consistency". There is lots of talk about consistency, and eventual consistency, and the CAP theorem, and things like that.
And it's all very vague.
if you search online posts related to HBase and Cassandra comparisons, you will regularly find the HBase community explaining that they have chosen CP, while Cassandra has chosen AP – no doubt mindful of the fact that most developers need consistency (the C) at some level.
Indeed, HBase's own documentation says:
Strongly consistent reads/writes: HBase is not an "eventually consistent" DataStore. This makes it very suitable for tasks such as high-speed counter aggregation.
So I guess that the HBase development team is choosing to define "strongly consistent" as "not 'eventually consistent'". Which isn't very much of a definition, in my opinion.
If you search still more, you'll find more detailed information, such as this HBase page on ACID semantics, which admits that:
HBase is not an ACID compliant database.
and then proceeds to completely re-define the famous ACID properties that Jim Gray set forth nearly 35 years ago.
It's very instructive to compare the original relational database definitions of the ACID properties versus the HBase definitions.
First, here's the class relational DBMS definitions, from the above Wikipedia article:
Atomicity requires that each transaction is "all or nothing": if one part of the transaction fails, the entire transaction fails, and the database state is left unchanged. An atomic system must guarantee atomicity in each and every situation, including power failures, errors, and crashes.
The consistency property ensures that any transaction will bring the database from one valid state to another. Any data written to the database must be valid according to all defined rules, including but not limited to constraints, cascades, triggers, and any combination thereof.
Isolation refers to the requirement that no transaction should be able to interfere with another transaction. One way of achieving this is to ensure that no transactions that affect the same rows can run concurrently, since their sequence, and hence the outcome, might be unpredictable. This property of ACID is often partly relaxed due to the huge speed decrease this type of concurrency management entails.
Durability means that once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors. In a relational database, for instance, once a group of SQL statements execute, the results need to be stored permanently. If the database crashes immediately thereafter, it should be possible to restore the database to the state after the last transaction committed.
Now, here's the HBase definitions, from the HBase ACID semantics page:
For the sake of common vocabulary, we define the following terms:
an operation is atomic if it either completes entirely or not at all
all actions cause the table to transition from one valid state directly to another (eg a row will not disappear during an update, etc)
an operation is isolated if it appears to complete independently of any other concurrent transaction
any update that reports "successful" to the client will not be lost
an update is considered visible if any subsequent read will see the update as having been committed
These aren't even remotely close to the same definitions!
It's not at all clear what the NoSQL community is trying to do by re-defining all these words, and it's doubly not clear why the entire computing industry appears to be going along with it.
Why not define new terminology? Why change the meanings of words that have had precise definitions for about as long as general purpose computers have been in use?