What if we wanted to run HBase in multiple datacenters? Typically, it is not recommended to run a single HBase cluster stretched across datacenters. What we want to do instead is to set up independent HBase clusters in each datacenter. If this is what we want, how do we ensure that the same data is available from both clusters? We've a couple of options.
We can have the application do a dual ingest. In other words, the application is aware that multiple HBase clusters exist. It explicitly connects to each of the HBase clusters and stores the data. In this setup, if one or more clusters are unavailable, it is the responsibility of the application to keep track of what data has been written to what clusters and ensure all of the writes have landed on all clusters.
The other option is to leverage HBase's native replication to achieve synchronization between clusters. HBase's replication is asynchronous, which means that once a write has been persisted in the source cluster, the client immediately receives the write acknowledgement. At some point in the future, the source cluster attempts to forward the write to the target cluster. The target cluster receives the write, applies it locally, and acknowledges this to the source cluster. The source cluster now knows that the target cluster is caught up till that point in the stream. The replication is weaker than strong consistency and is timeline consistent, in the sense that the updates are applied in the same order in both the source and target cluster, but might be arbitrarily delayed at the target cluster. Hence, there's a potential for stale reads at the target cluster.
Let's next examine how HBase replication performs under different types of workloads and the kind of inconsistency that can arise during failover. Firstly, let's define what failover means. Let's assume we are running clusters in an active-standby configuration. By this, we mean that readers and writers are both connected to a single HBase cluster at any point in time. The edits on the HBase cluster are asynchronously replicated to a peer cluster. When the cluster becomes unavailable, both readers and writers switch over to another cluster.
HBase asynchronous replication isn't just about traditional master-slave replication. HBase also supports multimaster replication. In multimaster replication, both clusters are asynchronously replicating to their peer. Wouldn't this cause race conditions? Wouldn't updates clobber each other? As we've discussed previously, in HBase, the cells are versioned with a timestamp. The timestamp is, by default, the timestamp of the insert in the source cluster. When the update is replicated over to the target cluster, it is inserted with the timestamp from the source cluster. Hence, the update is recorded with the same timestamp in the source and target clusters.
This means that if the two clusters were partitioned off from each other and some writers were writing to cluster A and others were writing to cluster B, when the partition is resolved, updates will flow again between A and B. Both sets of updates will be recorded in the same way in the two clusters and, eventually, the clusters will be in sync.
However, if failover is implemented from the perspective of writers switching over to a peer cluster when the local cluster is unavailable, there is a potential for inconsistency, depending on the type of write workload.
If the writers are issuing blind writes (that is, writes that aren't preceded by a read), there is no potential for inconsistency. Let's say that the writers were sending their updates to cluster A and then cluster A went down and the writers switched to cluster B. When cluster A eventually comes back up, cluster B will send over all the delta updates to cluster A. Cluster A will eventually be caught up.
What if the workload was a read-modify-write? Say that the client read the value 2 for a given cell, incremented it by 2, and wrote the value 4 back into the database. At this point, the cluster became unavailable, before the new value 4 could be replicated to the peer database. When the client fails over to the new database and needs to increment the value in the cell again, it applies the increment on the 2, not the new value 4. Hence, this results in inconsistent data within the database. Such inconsistency is unavoidable in any database that supports only asynchronous replication between peer clusters.