When a read for a key is issued from the HBase client, as with writes, the client needs to query the META table to identify the RegionServer that contains the data for the given key.
Once the RegionServer receives the read request, it looks it up in the Memstore. However, this by itself is insufficient. The contents of the Memstore might have been flushed to disk by the time the read request arrives, so the RegionServer has to look for the key in the HFile that the Memstore contents were previously flushed into. However, it's not sufficient to look at the most recent HFile for the region since writes for that key could have arrived at any time in the past, so the RegionServer has to look for the key in every HFile that exists for that region.
In other words, fragments of the record exist in the Memstore and various HFiles. On a read request, these record fragments have to be retrieved to reconstruct the latest snapshot of the record to be served out.
This displays one of the fundamental trade-offs that exists in HBase. By converting random writes to append-only operations, disk seeks are minimized on the write path, but this comes at the cost of multiple seeks on the read path.
There are a couple of additional structures that exist to optimize reading data from the Files:
- Ordering: At the time of flushing the Memstore segment for a given region, the data is sorted by key at the time it is flushed to disk. Within a given row, data is sorted by column. To retrieve data for a given row and column, it is possible to do something akin to a binary search to quickly triangulate to a given value.
- Indexes: Since data is sorted, it is possible to create an index. The index is made up of multiple index blocks, with each index block recording the start key for a given set of data blocks. A quick scan of the index blocks makes it easy to triangulate more quickly to a given value. It's possible to index these index blocks to create a multilevel index to further minimize the number of seeks required to triangulate to a value within an HFile.
- Bloom filters: HBase uses a probabilistic data structure called a bloom filter to avoid reading an HFile altogether if it doesn't contain data for a given key. A bloom filter can be thought of as a collection of bit vectors and hash functions. Each key is hashed using each of the hash functions, and the bit in the corresponding hash index is toggled. The same logic is used to check whether a key is present in the bloom filter. Only if the bits in the corresponding hash index in all of the bit vectors are one will we conclude that the key is present in the bloom filter. A bloom filter can produce false positives since multiple keys might hash to the same hash indices in different bit vectors. A bloom filter might claim to have a key that it doesn't have, but it never fails to claim a key that it does have.
Each HFile comes with a bloom filter that describes all the keys that are present in that HFile. If the bloom filter doesn't have that key, the HFile is never queried. If the bloom filter has the key, the HFile is queried to retrieve data for the key.