Seven Databases in Seven Weeks, 2nd Edition

Wrap-Up

HBase is a juxtaposition of simplicity and complexity. Its data storage model is pretty straightforward, with a lot of flexibility and just a few built-in schema constraints. A major barrier to understanding HBase, though, stems from the fact that many terms are overloaded with baggage from the relational world (such as table and column). Schema design in HBase typically boils down to deciding on the performance characteristics that you want to apply to your tables and columns, which is pretty far afield from the relational world, where things usually hinge upon table design.

HBase’s Strengths

Noteworthy features of HBase include a robust scale-out architecture and built-in versioning and compression capabilities. HBase’s built-in versioning capability can be a compelling feature for certain use cases. Keeping the version history of wiki pages is a crucial feature for policing and maintenance, for instance. By choosing HBase, you don’t have to take any special steps to implement page history—you get it for free. No other database in this book offers that out of the box.

On the performance front, HBase is meant to scale out. If you have huge amounts of data, measured in many terabytes or more, HBase may be for you. HBase is rack aware, replicating data within and between datacenter racks so that node failures can be handled gracefully and quickly.

The HBase community is pretty awesome. There’s almost always somebody on the #hbase IRC channel,^[23] on HBase’s dedicated Slack channel,^[24] or on the mailing list^[25] ready to help with questions and get you pointed in the right direction.

HBase’s Weaknesses

Although HBase is designed to scale out, it doesn’t scale down. The HBase community seems to agree that five nodes is the minimum number you’ll want to use. Because it’s designed to be big, it can also be harder to administrate (though platforms like EMR, which you saw in Day 3, do provide some good managed options). Solving small problems isn’t what HBase is about, and nonexpert documentation is tough to come by, which steepens the learning curve.

Additionally, HBase is almost never deployed alone. Instead, it is usually used in conjunction with other scale-ready infrastructure piece. These include Hadoop (an implementation of Google’s MapReduce), the Hadoop distributed file system (HDFS), Zookeeper (a headless service that aids internode coordination), and Apache Spark (a popular cluster computing platform). This ecosystem is both a strength and a weakness; it simultaneously provides an ever-expanding set of powerful tools, but making them all work in conjunction with one another across many machines—sometimes in the thousands—can be quite cumbersome.

One noteworthy characteristic of HBase is that it doesn’t offer any sorting or indexing capabilities aside from the row keys. Rows are kept in sorted order by their row keys, but no such sorting is done on any other field, such as column names and values. So, if you want to find rows by something other than their key, you need to either scan the table or maintain your own index (perhaps in a separate HBase table or in an external system).

Another missing concept is datatypes. All field values in HBase are treated as uninterpreted arrays of bytes. There is no distinction between, say, an integer value, a string, and a date. They’re all bytes to HBase, so it’s up to your application to interpret the bytes, which can be tricky, especially if you’re used to relational access patterns like object-relational mappers (ORMs).

HBase on CAP

With respect to CAP, HBase is decidedly CP (for more information on the CAP theorem, see Appendix 2, The CAP Theorem). HBase makes strong consistency guarantees. If a client succeeds in writing a value, other clients will receive the updated value on the next request. Some databases allow you to tweak the CAP equation on a per-operation basis. Not so with HBase. In the face of reasonable amounts of partitioning—for example, a node failing—HBase will remain available, shunting the responsibility off to other nodes in the cluster. However, in the pathological case, where only one node is left alive, HBase has no choice but to refuse requests.

The CAP discussion gets a little more complex when you introduce cluster-to-cluster replication, an advanced feature we didn’t cover in this chapter. A typical multicluster setup could have clusters separated geographically by some distance. In this case, for a given column family, one cluster is the system of record, while the other clusters merely provide access to the replicated data. This system is eventually consistent because the replication clusters will serve up the most recent values they’re aware of, which may not be the most recent values in the master cluster.

Parting Thoughts

HBase can be quite a challenge at first. The terminology is often deceptively reassuring, and the installation and configuration are not for the faint of heart. On the plus side, some of the features HBase offers, such as versioning and compression, are quite unique. These aspects can make HBase quite appealing for solving certain problems. And of course, it scales out to many nodes of commodity hardware quite well. All in all, HBase—like a nail gun—is a pretty big tool, so watch your thumbs.

Footnotes

[12]: http://research.google.com/archive/bigtable.html
[13]: https://github.com/apache/hbase/blob/master/hbase-common/src/main/resources/hbase-default.xml
[14]: https://dumps.wikimedia.org/enwiki/latest
[15]: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
[16]: https://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-articles.xml.bz2
[17]: https://www.cnpp.usda.gov/Innovations/DataSource/MyFoodapediaData.zip
[18]: http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/monitor_estimated_http://charges_with_cloudwatch.html
[19]: http://aws.amazon.com/
[20]: https://console.aws.amazon.com/iam
[21]: https://console.aws.amazon.com/iam
[22]: https://console.aws.amazon.com/elasticmapreduce
[23]: irc://irc.freenode.net/#hbase
[24]: https://apache-hbase.slack.com/
[25]: http://hbase.apache.org/mail-lists.html

Previous Chapter

Day 3: Taking It to the Cloud

Next Chapter

4. MongoDB

Table of Contents for Seven Databases in Seven Weeks, 2nd Edition