Chapter 3
HBase

Apache HBase is made for big jobs, like a nail gun. You would never use HBase to catalog your corporate sales list or build a to-do list app for fun, just like you’d never use a nail gun to build a doll house. If the size of your dataset isn’t many, many gigabytes at the very least then you should probably use a less heavy-duty tool.

At first glance, HBase looks a lot like a relational database, so much so that if you didn’t know any better, you might think that it is one. In fact, the most challenging part of learning HBase isn’t the technology; it’s that many of the words used in HBase are deceptively familiar. For example, HBase stores data in buckets it calls tables, which contain cells that appear at the intersection of rows and columns. Sounds like a relational database, right?

Wrong! In HBase, tables don’t behave like relations, rows don’t act like records, and columns are completely variable and not enforced by any predefined schema. Schema design is still important, of course, because it informs the performance characteristics of the system, but it won’t keep your house in order—that task falls to you and how your applications use HBase. In general, trying to shoehorn HBase into an RDBMS-style system is fraught with nothing but peril and a certain path to frustration and failure. HBase is the evil twin, the bizarro doppelgänger, if you will, of RDBMS.

On top of that, unlike relational databases, which sometimes have trouble scaling out, HBase doesn’t scale down. If your production HBase cluster has fewer than five nodes, then, quite frankly, you’re doing it wrong. HBase is not the right database for some problems, particularly those where the amount of data is measured in megabytes, or even in the low gigabytes.

So why would you use HBase? Aside from scalability, there are a few reasons. To begin with, HBase has some built-in features that other databases lack, such as versioning, compression, garbage collection (for expired data), and in-memory tables. Having these features available right out of the box means less code that you have to write when your requirements demand them. HBase also makes strong consistency guarantees, making it easier to transition from relational databases for some use cases. Finally, HBase guarantees atomicity at the row level, which means that you can have strong consistency guarantees at a crucial level of HBase’s data model.

For all of these reasons, HBase really shines as the cornerstone of a large-scale online analytics processing system. While individual operations may sometimes be slower than equivalent operations in other databases, scanning through enormous datasets is an area where HBase truly excels. For genuinely big queries, HBase often outpaces other databases, which helps to explain why HBase is often used at big companies to back heavy-duty logging and search systems.

Table of Contents for Seven Databases in Seven Weeks, 2nd Edition

Chapter 3HBase

Table of Contents for
Seven Databases in Seven Weeks, 2nd Edition

Chapter 3
HBase