Table of Contents for
Seven Databases in Seven Weeks, 2nd Edition

Version ebook / Retour

Cover image for bash Cookbook, 2nd Edition Seven Databases in Seven Weeks, 2nd Edition by Jim Wilson Published by Pragmatic Bookshelf, 2018
  1. Title Page
  2. Seven Databases in Seven Weeks, Second Edition
  3. Seven Databases in Seven Weeks, Second Edition
  4. Seven Databases in Seven Weeks, Second Edition
  5. Seven Databases in Seven Weeks, Second Edition
  6.  Acknowledgments
  7.  Preface
  8. Why a NoSQL Book
  9. Why Seven Databases
  10. What’s in This Book
  11. What This Book Is Not
  12. Code Examples and Conventions
  13. Credits
  14. Online Resources
  15. 1. Introduction
  16. It Starts with a Question
  17. The Genres
  18. Onward and Upward
  19. 2. PostgreSQL
  20. That’s Post-greS-Q-L
  21. Day 1: Relations, CRUD, and Joins
  22. Day 2: Advanced Queries, Code, and Rules
  23. Day 3: Full Text and Multidimensions
  24. Wrap-Up
  25. 3. HBase
  26. Introducing HBase
  27. Day 1: CRUD and Table Administration
  28. Day 2: Working with Big Data
  29. Day 3: Taking It to the Cloud
  30. Wrap-Up
  31. 4. MongoDB
  32. Hu(mongo)us
  33. Day 1: CRUD and Nesting
  34. Day 2: Indexing, Aggregating, Mapreduce
  35. Day 3: Replica Sets, Sharding, GeoSpatial, and GridFS
  36. Wrap-Up
  37. 5. CouchDB
  38. Relaxing on the Couch
  39. Day 1: CRUD, Fauxton, and cURL Redux
  40. Day 2: Creating and Querying Views
  41. Day 3: Advanced Views, Changes API, and Replicating Data
  42. Wrap-Up
  43. 6. Neo4J
  44. Neo4j Is Whiteboard Friendly
  45. Day 1: Graphs, Cypher, and CRUD
  46. Day 2: REST, Indexes, and Algorithms
  47. Day 3: Distributed High Availability
  48. Wrap-Up
  49. 7. DynamoDB
  50. DynamoDB: The “Big Easy” of NoSQL
  51. Day 1: Let’s Go Shopping!
  52. Day 2: Building a Streaming Data Pipeline
  53. Day 3: Building an “Internet of Things” System Around DynamoDB
  54. Wrap-Up
  55. 8. Redis
  56. Data Structure Server Store
  57. Day 1: CRUD and Datatypes
  58. Day 2: Advanced Usage, Distribution
  59. Day 3: Playing with Other Databases
  60. Wrap-Up
  61. 9. Wrapping Up
  62. Genres Redux
  63. Making a Choice
  64. Where Do We Go from Here?
  65. A1. Database Overview Tables
  66. A2. The CAP Theorem
  67. Eventual Consistency
  68. CAP in the Wild
  69. The Latency Trade-Off
  70.  Bibliography
  71. Seven Databases in Seven Weeks, Second Edition

Wrap-Up

HBase is a juxtaposition of simplicity and complexity. Its data storage model is pretty straightforward, with a lot of flexibility and just a few built-in schema constraints. A major barrier to understanding HBase, though, stems from the fact that many terms are overloaded with baggage from the relational world (such as table and column). Schema design in HBase typically boils down to deciding on the performance characteristics that you want to apply to your tables and columns, which is pretty far afield from the relational world, where things usually hinge upon table design.

HBase’s Strengths

Noteworthy features of HBase include a robust scale-out architecture and built-in versioning and compression capabilities. HBase’s built-in versioning capability can be a compelling feature for certain use cases. Keeping the version history of wiki pages is a crucial feature for policing and maintenance, for instance. By choosing HBase, you don’t have to take any special steps to implement page history—you get it for free. No other database in this book offers that out of the box.

On the performance front, HBase is meant to scale out. If you have huge amounts of data, measured in many terabytes or more, HBase may be for you. HBase is rack aware, replicating data within and between datacenter racks so that node failures can be handled gracefully and quickly.

The HBase community is pretty awesome. There’s almost always somebody on the #hbase IRC channel,[23] on HBase’s dedicated Slack channel,[24] or on the mailing list[25] ready to help with questions and get you pointed in the right direction.

HBase’s Weaknesses

Although HBase is designed to scale out, it doesn’t scale down. The HBase community seems to agree that five nodes is the minimum number you’ll want to use. Because it’s designed to be big, it can also be harder to administrate (though platforms like EMR, which you saw in Day 3, do provide some good managed options). Solving small problems isn’t what HBase is about, and nonexpert documentation is tough to come by, which steepens the learning curve.

Additionally, HBase is almost never deployed alone. Instead, it is usually used in conjunction with other scale-ready infrastructure piece. These include Hadoop (an implementation of Google’s MapReduce), the Hadoop distributed file system (HDFS), Zookeeper (a headless service that aids internode coordination), and Apache Spark (a popular cluster computing platform). This ecosystem is both a strength and a weakness; it simultaneously provides an ever-expanding set of powerful tools, but making them all work in conjunction with one another across many machines—sometimes in the thousands—can be quite cumbersome.

One noteworthy characteristic of HBase is that it doesn’t offer any sorting or indexing capabilities aside from the row keys. Rows are kept in sorted order by their row keys, but no such sorting is done on any other field, such as column names and values. So, if you want to find rows by something other than their key, you need to either scan the table or maintain your own index (perhaps in a separate HBase table or in an external system).

Another missing concept is datatypes. All field values in HBase are treated as uninterpreted arrays of bytes. There is no distinction between, say, an integer value, a string, and a date. They’re all bytes to HBase, so it’s up to your application to interpret the bytes, which can be tricky, especially if you’re used to relational access patterns like object-relational mappers (ORMs).

HBase on CAP

With respect to CAP, HBase is decidedly CP (for more information on the CAP theorem, see Appendix 2, The CAP Theorem). HBase makes strong consistency guarantees. If a client succeeds in writing a value, other clients will receive the updated value on the next request. Some databases allow you to tweak the CAP equation on a per-operation basis. Not so with HBase. In the face of reasonable amounts of partitioning—for example, a node failing—HBase will remain available, shunting the responsibility off to other nodes in the cluster. However, in the pathological case, where only one node is left alive, HBase has no choice but to refuse requests.

The CAP discussion gets a little more complex when you introduce cluster-to-cluster replication, an advanced feature we didn’t cover in this chapter. A typical multicluster setup could have clusters separated geographically by some distance. In this case, for a given column family, one cluster is the system of record, while the other clusters merely provide access to the replicated data. This system is eventually consistent because the replication clusters will serve up the most recent values they’re aware of, which may not be the most recent values in the master cluster.

Parting Thoughts

HBase can be quite a challenge at first. The terminology is often deceptively reassuring, and the installation and configuration are not for the faint of heart. On the plus side, some of the features HBase offers, such as versioning and compression, are quite unique. These aspects can make HBase quite appealing for solving certain problems. And of course, it scales out to many nodes of commodity hardware quite well. All in all, HBase—like a nail gun—is a pretty big tool, so watch your thumbs.