HDFS is an append-only file system, so how could a database that supports random record updates be built on top of it?
HBase is what's called a log-structured merge tree, or an LSM, database. In an LSM database, data is stored within a multilevel storage hierarchy, with movement of data between levels happening in batches. Cassandra is another example of an LSM database.
When a write for a key is issued from the HBase client, the client looks up Zookeeper to get the location of the RegionServer that hosts the META region. It then queries the META region to find out a table's regions, their key ranges, and the RegionServers they are hosted on.
The client then makes an RPC call to the RegionServer that contains the key in the write request. The RegionServer receives the data for the key, immediately persists this in an in-memory structure called the Memstore, and then returns success back to the client. The Memstore can fill up once a large enough amount of data accumulates within it. At that point, the contents of the Memstore get emptied out onto the disk in a format called HFile.
What happens if the RegionServer crashes before the Memstore contents can be flushed to disk? There would have been irrevocable data loss. To avoid this data loss, each update to the Memstore is first persisted in a write-ahead log. The write-ahead log is maintained on HDFS to take advantage of the replication that is offered by HDFS, thereby ensuring high availability. If the RegionServer goes down, the contents of the WAL are replayed from the same or a replicant DataNode in order to recover the edits that were in the Memstore at the time of the RegionServer crash.
The Memstore is divided into memory segments that are maintained in each region. The flush of the memory segment associated with a region can be done independently of other regions. When the contents of the memory segment are ready to be flushed, data is sorted by key before it is written out to disk. These sorted, disk-based structures allow for performant access to individual records.
On the other hand, the WAL contains edits in the time order in which they were processed by the RegionServer. While data files in HBase are partitioned by region, each RegionServer today maintains a single active WAL for all of the regions hosted on the RegionServer:
