Seven Databases in Seven Weeks, 2nd Edition

Introducing HBase

HBase is a column-oriented database that prides itself on its ability to provide both consistency and scalability. It is based on Bigtable, a high-performance, proprietary database developed by Google and described in the 2006 white paper “Bigtable: A Distributed Storage System for Structured Data.”^[12] Initially created for natural language processing, HBase started life as a contrib package for Apache Hadoop. Since then, it has become a top-level Apache project.

Luc says:

Hosted HBase with Google Cloud Bigtable

As you’ll see later in this chapter, HBase can be tricky to administer. Fortunately, there’s now a compelling option for those who want to utilize the power of HBase with very little operational burden: Google’s Cloud Bigtable, which is part of its Cloud Platform suite of products. Cloud Bigtable isn’t 100% compatible with HBase but as of early 2018 it’s very close—close enough that you may be able to migrate many existing HBase applications over.

If you find the basic value proposition of, for example, Amazon’s cloud-based DynamoDB compelling and you think HBase is a good fit for a project, then Cloud Bigtable might be worth checking out. You can at least be assured that it’s run by the same company that crafted the concepts behind HBase (and the folks at Google do seem to know a thing or two about scale).

On the architecture front, HBase is designed to be fault tolerant. Hardware failures may be uncommon in individual machines but, in large clusters, node failure is the norm (as are network issues). HBase can gracefully recover from individual server failures because it uses both write-ahead logging, which writes data to an in-memory log before it’s written (so that nodes can use the log for recovery rather than disk), and distributed configuration, which means that nodes can rely on each other for configuration rather than on a centralized source.

Additionally, HBase lives in the Hadoop ecosystem, where it benefits from its proximity to other related tools. Hadoop is a sturdy, scalable computing platform that provides a distributed file system and MapReduce capabilities. Wherever you find HBase, you’ll find Hadoop and other infrastructural components that you can use in your own applications, such as Apache Hive, a data warehousing tool, and Apache Pig, a parallel processing tool (and many others).

Finally, HBase is actively used and developed by a number of high-profile companies for their “Big Data” problems. Notably, Facebook uses HBase for a variety of purposes, including for Messages, search indexing, and stream analysis. Twitter uses it to power its people search capability, for monitoring and performance data, and more. Airbnb uses it as part of their realtime stream processing stack. Apple uses it for...something, though they won‘t publicly say what. The parade of companies using HBase also includes the likes of eBay, Meetup, Ning, Yahoo!, and many others.

With all of this activity, new versions of HBase are coming out at a fairly rapid clip. At the time of this writing, the current stable version is 1.2.1, so that’s what you’ll be using. Go ahead and download HBase, and we’ll get started.

Previous Chapter

3. HBase

Next Chapter

Day 1: CRUD and Table Administration

Table of Contents for Seven Databases in Seven Weeks, 2nd Edition

Introducing HBase

Table of Contents for
Seven Databases in Seven Weeks, 2nd Edition