Table of Contents for
Seven Databases in Seven Weeks, 2nd Edition

Version ebook / Retour

Cover image for bash Cookbook, 2nd Edition Seven Databases in Seven Weeks, 2nd Edition by Jim Wilson Published by Pragmatic Bookshelf, 2018
  1. Title Page
  2. Seven Databases in Seven Weeks, Second Edition
  3. Seven Databases in Seven Weeks, Second Edition
  4. Seven Databases in Seven Weeks, Second Edition
  5. Seven Databases in Seven Weeks, Second Edition
  6.  Acknowledgments
  7.  Preface
  8. Why a NoSQL Book
  9. Why Seven Databases
  10. What’s in This Book
  11. What This Book Is Not
  12. Code Examples and Conventions
  13. Credits
  14. Online Resources
  15. 1. Introduction
  16. It Starts with a Question
  17. The Genres
  18. Onward and Upward
  19. 2. PostgreSQL
  20. That’s Post-greS-Q-L
  21. Day 1: Relations, CRUD, and Joins
  22. Day 2: Advanced Queries, Code, and Rules
  23. Day 3: Full Text and Multidimensions
  24. Wrap-Up
  25. 3. HBase
  26. Introducing HBase
  27. Day 1: CRUD and Table Administration
  28. Day 2: Working with Big Data
  29. Day 3: Taking It to the Cloud
  30. Wrap-Up
  31. 4. MongoDB
  32. Hu(mongo)us
  33. Day 1: CRUD and Nesting
  34. Day 2: Indexing, Aggregating, Mapreduce
  35. Day 3: Replica Sets, Sharding, GeoSpatial, and GridFS
  36. Wrap-Up
  37. 5. CouchDB
  38. Relaxing on the Couch
  39. Day 1: CRUD, Fauxton, and cURL Redux
  40. Day 2: Creating and Querying Views
  41. Day 3: Advanced Views, Changes API, and Replicating Data
  42. Wrap-Up
  43. 6. Neo4J
  44. Neo4j Is Whiteboard Friendly
  45. Day 1: Graphs, Cypher, and CRUD
  46. Day 2: REST, Indexes, and Algorithms
  47. Day 3: Distributed High Availability
  48. Wrap-Up
  49. 7. DynamoDB
  50. DynamoDB: The “Big Easy” of NoSQL
  51. Day 1: Let’s Go Shopping!
  52. Day 2: Building a Streaming Data Pipeline
  53. Day 3: Building an “Internet of Things” System Around DynamoDB
  54. Wrap-Up
  55. 8. Redis
  56. Data Structure Server Store
  57. Day 1: CRUD and Datatypes
  58. Day 2: Advanced Usage, Distribution
  59. Day 3: Playing with Other Databases
  60. Wrap-Up
  61. 9. Wrapping Up
  62. Genres Redux
  63. Making a Choice
  64. Where Do We Go from Here?
  65. A1. Database Overview Tables
  66. A2. The CAP Theorem
  67. Eventual Consistency
  68. CAP in the Wild
  69. The Latency Trade-Off
  70.  Bibliography
  71. Seven Databases in Seven Weeks, Second Edition

Day 3: Taking It to the Cloud

On Days 1 and 2, you got quite a lot of hands-on experience using HBase in standalone mode. Our experimentation so far has focused on accessing a single local server. But in reality, if you choose to use HBase, you’ll want to have a good-sized cluster in order to realize the performance benefits of its distributed architecture. And nowadays, there’s also an increasingly high chance that you’ll want to run it in the cloud.

Here on Day 3, let’s turn our attention toward operating and interacting with a remote HBase cluster. First, you’ll deploy an HBase cluster on Amazon Web Services’ Elastic MapReduce platform (more commonly known as AWS and EMR, respectively) using AWS’s command-line tool, appropriately named aws. Then, you’ll connect directly to our remote HBase cluster using Secure Shell (SSH) and perform some basic operations.

Initial AWS and EMR Setup

EMR is a managed Hadoop platform for AWS. It enables you to run a wide variety of servers in the Hadoop ecosystem—Hive, Pig, HBase, and many others—on EC2 without having to engage in a lot of the nitty-gritty details usually associated with managing those systems.

Before you can get started spinning up an HBase cluster, you’ll need to sign up for an AWS account.[19] Once you’ve created an account, log into the IAM service in the AWS console[20] and create a new user by clicking Add User.

During the user creation process, select “Programmatic access” and then click “Attach existing policies directly.” Select the following policies: IAMFullAccess, AmazonEC2FullAccess, and AmazonElasticMapReduceFullAccess. Then, fetch your AWS access key and secret key from a different section of the console.[21] With that information in hand, install the aws tool using pip and then run aws --version to ensure that the tool installed properly. To configure the client, just run:

 $ ​​aws​​ ​​configure

This will prompt you to enter your access key and secret key and two other pieces of information: a default region name (basically which AWS datacenter you’d like to use) and a default output format. Input us-east-1 and json respectively (though feel free to select a different region if you’d like; the authors happen to be partial to us-west-1 in Oregon). To make sure that your setup is now in place, run aws emr list-clusters, which lists the clusters you’ve created in EMR. That should return an empty list:

 {
 "Clusters"​: []
 }

In AWS, your ability to perform actions is based on which roles you possess as a user. We won’t delve into service access control here. For our purposes, you just need to create a set of roles that enable you to access EMR and to spin up, manage, and finally access clusters. You can create the necessary roles with one convenient built-in command:

 $ ​​aws​​ ​​emr​​ ​​create-default-roles

Once your HBase cluster is up and running in a little bit, you’ll need to be able to access it remotely from your own machine. In AWS, direct access to remote processes is typically done over SSH. You’ll need to create a new SSH key pair, upload it to AWS, and then specify that key pair by name when you create your cluster. Use these commands to create a key pair in your ~/.ssh directory and assign it restrictive permissions:

 $ ​​aws​​ ​​ec2​​ ​​create-key-pair​​ ​​\
  ​​--key-name​​ ​​HBaseShell​​ ​​\
  ​​--query​​ ​​'KeyMaterial'​​ ​​\
  ​​--output​​ ​​text​​ ​​>​​ ​​~/.ssh/hbase-shell-key.pem
 $ ​​chmod​​ ​​400​​ ​​~/.ssh/hbase-shell-key.pem

Now you have a key pair stored in the hbase-shell-key.pem file that you can use later to SSH into your cluster. To ensure that it’s been successfully created:

 $​ ​aws​ ​ec​2 ​describe-key-pairs
 {
 "KeyPairs"​: [
  {
 "KeyName"​: ​"HBaseShell"​,
 "KeyFingerprint"​: ​"1a:2b:3c:4d:1a:..."
  }
  ]
 }

Creating the Cluster

Now that that initial configuration detour is out of the way, you can get your hands dirty and create your HBase cluster.

 $ ​​aws​​ ​​emr​​ ​​create-cluster​​ ​​\
  ​​--name​​ ​​"Seven DBs example cluster"​​ ​​\
  ​​--release-label​​ ​​emr-5.3.1​​ ​​\
  ​​--ec2-attributes​​ ​​KeyName=HBaseShell​​ ​​\
  ​​--use-default-roles​​ ​​\
  ​​--instance-type​​ ​​m1.large​​ ​​\
  ​​--instance-count​​ ​​3​​ ​​\
  ​​--applications​​ ​​Name=HBase

That’s a pretty intricate shell command! Let’s break down some of the non-obvious parts.

  • --release-label specifies which release of EMR you’re working with.

  • --ec2-attributes specifies which key pair you want to use to create the cluster (which will enable you to have SSH access later).

  • --instance-type specifies which type of machine you want your cluster to run on.

  • --instance-count is the number of machines you want in the cluster (by default, 3 instances will mean one master node and two slave nodes).

  • --use-default-roles means that you’re using the default roles you created a minute ago.

  • --applications determines which Hadoop application you’ll install (just HBase for us).

If create-cluster is successful, you should get a JSON object back that displays the ID of the cluster. Here’s an example ID:

 {
 "ClusterId"​: ​"j-1MFV1QTNSBTD8"
 }

For convenience, store the cluster ID in your environment so it’s easier to use in later shell commands. This is always a good practice when working with AWS on the command line, as almost everything has a randomly generated identifier.

 $ ​​export​​ ​​CLUSTER_ID=j-1MFV1QTNSBTD8

You can verify that the cluster has been created by listing all of the clusters associated with your user account.

 $ ​​aws​​ ​​emr​​ ​​list-clusters

That command should now return a JSON object like this:

 {
 "Clusters"​: [
  {
 "Status"​: {
 "Timeline"​: {
 "CreationDateTime"​: 1487455208.825
  },
 "State"​: ​"STARTING"​,
 "StateChangeReason"​: {}
  },
 "NormalizedInstanceHours"​: 0,
 "Id"​: ​"j-1MFV1QTNSBTD8"​,
 "Name"​: ​"Seven DBs example cluster"
  }
  ]
 }

At this point, your cluster has been created but it will take a while to actually start, usually several minutes. Run this command, which checks every five seconds for the current status of the cluster (you should see "STARTING" at first):

 $ ​​while​​ ​​true;​​ ​​do
  aws emr describe-cluster \
  --cluster-id ${CLUSTER_ID} \
  --query Cluster.Status.State
  sleep 5
  done

Again, this could take a while, so take a coffee break, read some EMR documentation, whatever you feel like. Once the the state of the cluster turns to "WAITING", it should be ready to go. You can now inspect all three machines running in the cluster (one master and two slave nodes):

 $ ​​aws​​ ​​emr​​ ​​list-instances​​ ​​\
  ​​--cluster-id​​ ​​${CLUSTER_ID}

Each instance has its own configuration object associated with it that tells you each instance’s current status (RUNNING, TERMINATED, and so on), DNS name, ID, private IP address, and more.

Enabling Access to the Cluster

You have just one last step before you can access your HBase cluster via SSH. You need to authorize TCP ingress into the master node of the cluster. To do that, you need to get an identifier for the security group that it belongs to:

 $ ​​aws​​ ​​emr​​ ​​describe-cluster​​ ​​\
  ​​--cluster-id​​ ​​${CLUSTER_ID}​​ ​​\
  ​​--query​​ ​​Cluster.Ec2InstanceAttributes.EmrManagedMasterSecurityGroup

That should return something like sg-bd63e1ab. Set the SECURITY_GROUP_ID environment variable to that value. Now, you need to run a command that instructs EC2 (which controls the machines running the cluster) to allow TCP ingress on port 22 (used for SSH) from the IP address of your current machine, which you can set as an environment variable.

 $ ​​export​​ ​​MY_CIDR=$(dig​​ ​​+short​​ ​​myip.opendns.com​​ ​​@resolver1.opendns.com.)/32
 $ ​​aws​​ ​​ec2​​ ​​authorize-security-group-ingress​​ ​​\
  ​​--group-id​​ ​​${SECURITY_GROUP_ID}​​ ​​\
  ​​--protocol​​ ​​tcp​​ ​​\
  ​​--port​​ ​​22​​ ​​\
  ​​--cidr​​ ​​$MY_CIDR

Finally, you can SSH into the cluster with the handy emr ssh command and point to your local SSH keys and the correct cluster:

 $ ​​aws​​ ​​emr​​ ​​ssh​​ ​​\
  ​​--cluster-id​​ ​​${CLUSTER_ID}​​ ​​\
  ​​--key-pair-file​​ ​​~/.ssh/hbase-shell-key.pem

Once the SSH connection is established, you should see a huge ASCII banner whiz by before you’re dropped into a remote shell. Now you can open the HBase shell:

 $ ​​hbase​​ ​​shell

If you then see a shell prompt like hbase(main):001:0> pop up in your CLI, you’ve made it! You’re now using your own machine as a portal into an HBase cluster running in a datacenter far away (or maybe close by; pretty cool either way). Run a couple other HBase commands from previous exercises for fun:

 hbase(main):001:0> version
 hbase(main):002:0> status
 hbase(main):003:0> create 'messages', 'text'
 hbase(main):004:0> put 'messages', 'arrival', 'text:', 'HBase: now on AWS!'
 hbase(main):005:0> get 'messages', 'arrival'

As we mentioned before, always bear in mind that AWS costs money. The exercise that you went through today most likely cost less than a latté at the corner coffee shop. You’re free to leave the cluster running, especially if you want to do the Day 3 homework in the next section. You can shut your cluster down at any time using the terminate-clusters command:

 $ ​​aws​​ ​​emr​​ ​​terminate-clusters​​ ​​\
  ​​--cluster-ids​​ ​​${CLUSTER_ID}

Day 3 Wrap-Up

Today you stepped outside of your own machine and installed an HBase cluster in an AWS datacenter, connected your local machine to the remote cluster, played with some of the HBase shell commands that you learned on Day 1, and learned a bit about interacting with AWS services via the command line. This will come in handy when you work with Amazon’s DynamoDB and a variety of other AWS services.

Day 3 Homework

For today’s homework, open up the AWS documentation for the Find section. For the Do section, leave your HBase cluster running on EMR with the HBase shell open. Just remember to terminate the cluster when you’re done!

Find

  1. Use the help interface aws for the CLI tool to see which commands are available for the emr subcommand. Read through the help material for some of these commands to get a sense of some of the capabilities offered by EMR that we didn’t cover in today’s cluster building exercise. Pay special attention to scaling-related commands.

  2. Go to the EMR documentation at https://aws.amazon.com/documentation/emr and read up on how to use Simple Storage Service (S3) as a data store for HBase clusters.

Do

  1. In your HBase shell that you’re accessing via SSH, run some of the cluster metadata commands we explored on Day 2, such as scan ’hbase:meta’. Make note of anything that’s fundamentally different from what you saw when running HBase locally in standalone mode.

  2. Navigate around the EMR section of your AWS browser console[22] and find the console specific to your running HBase cluster. Resize your cluster down to just two machines by removing one of the slave nodes (known as core nodes). Then increase the cluster size back to three (with two slave/core nodes).

  3. Resizing a cluster in the AWS console is nice, but that’s not an automatable approach. The aws CLI tool enables you to resize a cluster programmatically. Consult the docs for the emr modify-instance-groups command by running aws emr modify-instance-groups help to find out how this works. Remove a machine from your cluster using that command.