Seven Databases in Seven Weeks, 2nd Edition

Day 3: Taking It to the Cloud

On Days 1 and 2, you got quite a lot of hands-on experience using HBase in standalone mode. Our experimentation so far has focused on accessing a single local server. But in reality, if you choose to use HBase, you’ll want to have a good-sized cluster in order to realize the performance benefits of its distributed architecture. And nowadays, there’s also an increasingly high chance that you’ll want to run it in the cloud.

Here on Day 3, let’s turn our attention toward operating and interacting with a remote HBase cluster. First, you’ll deploy an HBase cluster on Amazon Web Services’ Elastic MapReduce platform (more commonly known as AWS and EMR, respectively) using AWS’s command-line tool, appropriately named aws. Then, you’ll connect directly to our remote HBase cluster using Secure Shell (SSH) and perform some basic operations.

Initial AWS and EMR Setup

EMR is a managed Hadoop platform for AWS. It enables you to run a wide variety of servers in the Hadoop ecosystem—Hive, Pig, HBase, and many others—on EC2 without having to engage in a lot of the nitty-gritty details usually associated with managing those systems.

Whenever you’re using AWS—or any other cloud provider—always keep in mind that you’re using paid services. The exercise that you’re about to go through will probably be free for you, but it may end up costing a few units of whatever your local currency happens to be. You’re free to leave the cluster running, especially if you want to do the Day 3 homework in the next section, but we recommend terminating it whenever you’re done so that you don’t rack up unwanted costs. To do so at any time, use the terminate-clusters command (more on setting a CLUSTER_ID environment variable later in this section):

	$ aws emr terminate-clusters \
	--cluster-ids ${CLUSTER_ID}

You can also set up a usage-based alarm using AWS’s CloudWatch service in case you want some extra assurance that you won’t end up with any unpleasant billing surprises.^[18]

Before you can get started spinning up an HBase cluster, you’ll need to sign up for an AWS account.^[19] Once you’ve created an account, log into the IAM service in the AWS console^[20] and create a new user by clicking Add User.

During the user creation process, select “Programmatic access” and then click “Attach existing policies directly.” Select the following policies: IAMFullAccess, AmazonEC2FullAccess, and AmazonElasticMapReduceFullAccess. Then, fetch your AWS access key and secret key from a different section of the console.^[21] With that information in hand, install the aws tool using pip and then run aws --version to ensure that the tool installed properly. To configure the client, just run:

$ aws configure

This will prompt you to enter your access key and secret key and two other pieces of information: a default region name (basically which AWS datacenter you’d like to use) and a default output format. Input us-east-1 and json respectively (though feel free to select a different region if you’d like; the authors happen to be partial to us-west-1 in Oregon). To make sure that your setup is now in place, run aws emr list-clusters, which lists the clusters you’ve created in EMR. That should return an empty list:

	{
	"Clusters": []
	}

In AWS, your ability to perform actions is based on which roles you possess as a user. We won’t delve into service access control here. For our purposes, you just need to create a set of roles that enable you to access EMR and to spin up, manage, and finally access clusters. You can create the necessary roles with one convenient built-in command:

$ aws emr create-default-roles

Once your HBase cluster is up and running in a little bit, you’ll need to be able to access it remotely from your own machine. In AWS, direct access to remote processes is typically done over SSH. You’ll need to create a new SSH key pair, upload it to AWS, and then specify that key pair by name when you create your cluster. Use these commands to create a key pair in your ~/.ssh directory and assign it restrictive permissions:

	$ aws ec2 create-key-pair \
	--key-name HBaseShell \
	--query 'KeyMaterial' \
	--output text > ~/.ssh/hbase-shell-key.pem
	$ chmod 400 ~/.ssh/hbase-shell-key.pem

Now you have a key pair stored in the hbase-shell-key.pem file that you can use later to SSH into your cluster. To ensure that it’s been successfully created:

	$ aws ec2 describe-key-pairs
	{
	"KeyPairs": [
	{
	"KeyName": "HBaseShell",
	"KeyFingerprint": "1a:2b:3c:4d:1a:..."
	}
	]
	}

Creating the Cluster

Now that that initial configuration detour is out of the way, you can get your hands dirty and create your HBase cluster.

	$ aws emr create-cluster \
	--name "Seven DBs example cluster" \
	--release-label emr-5.3.1 \
	--ec2-attributes KeyName=HBaseShell \
	--use-default-roles \
	--instance-type m1.large \
	--instance-count 3 \
	--applications Name=HBase

That’s a pretty intricate shell command! Let’s break down some of the non-obvious parts.

--release-label specifies which release of EMR you’re working with.
--ec2-attributes specifies which key pair you want to use to create the cluster (which will enable you to have SSH access later).
--instance-type specifies which type of machine you want your cluster to run on.
--instance-count is the number of machines you want in the cluster (by default, 3 instances will mean one master node and two slave nodes).
--use-default-roles means that you’re using the default roles you created a minute ago.
--applications determines which Hadoop application you’ll install (just HBase for us).

If create-cluster is successful, you should get a JSON object back that displays the ID of the cluster. Here’s an example ID:

	{
	"ClusterId": "j-1MFV1QTNSBTD8"
	}

For convenience, store the cluster ID in your environment so it’s easier to use in later shell commands. This is always a good practice when working with AWS on the command line, as almost everything has a randomly generated identifier.

$ export CLUSTER_ID=j-1MFV1QTNSBTD8

You can verify that the cluster has been created by listing all of the clusters associated with your user account.

$ aws emr list-clusters

That command should now return a JSON object like this:

	{
	"Clusters": [
	{
	"Status": {
	"Timeline": {
	"CreationDateTime": 1487455208.825
	},
	"State": "STARTING",
	"StateChangeReason": {}
	},
	"NormalizedInstanceHours": 0,
	"Id": "j-1MFV1QTNSBTD8",
	"Name": "Seven DBs example cluster"
	}
	]
	}

At this point, your cluster has been created but it will take a while to actually start, usually several minutes. Run this command, which checks every five seconds for the current status of the cluster (you should see "STARTING" at first):

	$ while true; do
	aws emr describe-cluster \
	--cluster-id ${CLUSTER_ID} \
	--query Cluster.Status.State
	sleep 5
	done

Again, this could take a while, so take a coffee break, read some EMR documentation, whatever you feel like. Once the the state of the cluster turns to "WAITING", it should be ready to go. You can now inspect all three machines running in the cluster (one master and two slave nodes):

	$ aws emr list-instances \
	--cluster-id ${CLUSTER_ID}

Each instance has its own configuration object associated with it that tells you each instance’s current status (RUNNING, TERMINATED, and so on), DNS name, ID, private IP address, and more.

Enabling Access to the Cluster

You have just one last step before you can access your HBase cluster via SSH. You need to authorize TCP ingress into the master node of the cluster. To do that, you need to get an identifier for the security group that it belongs to:

	$ aws emr describe-cluster \
	--cluster-id ${CLUSTER_ID} \
	--query Cluster.Ec2InstanceAttributes.EmrManagedMasterSecurityGroup

That should return something like sg-bd63e1ab. Set the SECURITY_GROUP_ID environment variable to that value. Now, you need to run a command that instructs EC2 (which controls the machines running the cluster) to allow TCP ingress on port 22 (used for SSH) from the IP address of your current machine, which you can set as an environment variable.

	$ export MY_CIDR=$(dig +short myip.opendns.com @resolver1.opendns.com.)/32
	$ aws ec2 authorize-security-group-ingress \
	--group-id ${SECURITY_GROUP_ID} \
	--protocol tcp \
	--port 22 \
	--cidr $MY_CIDR

Finally, you can SSH into the cluster with the handy emr ssh command and point to your local SSH keys and the correct cluster:

	$ aws emr ssh \
	--cluster-id ${CLUSTER_ID} \
	--key-pair-file ~/.ssh/hbase-shell-key.pem

Once the SSH connection is established, you should see a huge ASCII banner whiz by before you’re dropped into a remote shell. Now you can open the HBase shell:

$ hbase shell

If you then see a shell prompt like hbase(main):001:0> pop up in your CLI, you’ve made it! You’re now using your own machine as a portal into an HBase cluster running in a datacenter far away (or maybe close by; pretty cool either way). Run a couple other HBase commands from previous exercises for fun:

	hbase(main):001:0> version
	hbase(main):002:0> status
	hbase(main):003:0> create 'messages', 'text'
	hbase(main):004:0> put 'messages', 'arrival', 'text:', 'HBase: now on AWS!'
	hbase(main):005:0> get 'messages', 'arrival'

As we mentioned before, always bear in mind that AWS costs money. The exercise that you went through today most likely cost less than a latté at the corner coffee shop. You’re free to leave the cluster running, especially if you want to do the Day 3 homework in the next section. You can shut your cluster down at any time using the terminate-clusters command:

	$ aws emr terminate-clusters \
	--cluster-ids ${CLUSTER_ID}

Day 3 Wrap-Up

Today you stepped outside of your own machine and installed an HBase cluster in an AWS datacenter, connected your local machine to the remote cluster, played with some of the HBase shell commands that you learned on Day 1, and learned a bit about interacting with AWS services via the command line. This will come in handy when you work with Amazon’s DynamoDB and a variety of other AWS services.

Day 3 Homework

For today’s homework, open up the AWS documentation for the Find section. For the Do section, leave your HBase cluster running on EMR with the HBase shell open. Just remember to terminate the cluster when you’re done!

Find

Use the help interface aws for the CLI tool to see which commands are available for the emr subcommand. Read through the help material for some of these commands to get a sense of some of the capabilities offered by EMR that we didn’t cover in today’s cluster building exercise. Pay special attention to scaling-related commands.
Go to the EMR documentation at https://aws.amazon.com/documentation/emr and read up on how to use Simple Storage Service (S3) as a data store for HBase clusters.

In your HBase shell that you’re accessing via SSH, run some of the cluster metadata commands we explored on Day 2, such as scan ’hbase:meta’. Make note of anything that’s fundamentally different from what you saw when running HBase locally in standalone mode.
Navigate around the EMR section of your AWS browser console^[22] and find the console specific to your running HBase cluster. Resize your cluster down to just two machines by removing one of the slave nodes (known as core nodes). Then increase the cluster size back to three (with two slave/core nodes).
Resizing a cluster in the AWS console is nice, but that’s not an automatable approach. The aws CLI tool enables you to resize a cluster programmatically. Consult the docs for the emr modify-instance-groups command by running aws emr modify-instance-groups help to find out how this works. Remove a machine from your cluster using that command.

Previous Chapter

Day 2: Working with Big Data

Next Chapter

Wrap-Up