Now that we have an understanding of how to execute basic HBase operations via the shell, let's try and attempt them through the Java API:
Configuration conf = HBaseConfiguration.create();
Connection conn = ConnectionFactory.createConnection(conf);
The recommended way in which the configuration should be provided to an HBase client application is to copy over the hbase-site.xml from the cluster and make it available on the classpath of the client application (typically included in src/main/resources).
The HBaseConfiguration class reads the hbase-site.xml and populates properties such as the Zookeeper quorum hosts and ports, within a Configuration object.
The ConnectionFactory class handles the lifecycle management of Connections to an HBase cluster. The Connection class encapsulates TCP connections to the RegionServers, as well as a local cache of the META region, which contains the region assignments.
Connections are heavyweight objects. Thankfully, they are also thread safe, so Connection objects only need to be created once per service lifetime, and are reused on every request, whether it's a DDL or a DML action.
Failing to ensure that connections are only created once at service startup, doing so on every request, and creating a ton of the Connection objects is a common mistake, and puts quite a bit of stress on the cluster:
Admin admin = conn.getAdmin();
HTableDescriptor descriptor = new
HTableDescriptor(TableName.valueOf("sensor_telemetry"));
descriptor.addFamily(new HColumnDescriptor("metrics"));
admin.createTable(descriptor);
Once you have the Connection object, an Admin object is what you need to execute DDL operations, such as creating a table or altering the attributes of an existing table. Admin objects are not thread safe, but are thankfully lightweight to create for each DDL operation. The HTableDescriptor is simply a holding object for all of the attributes an HBase table can be created with:
Table table = conn.getTable(TableName.valueOf("sensor_telemetry"));
String key = "/94555/20170308/18:30";
Double temperature = 65;
Put sensorReading = new Put(Bytes.toBytes(key));
sensorReading.addColumn("metrics",
"temperature", Bytes.toBytes(temperature));
table.put(sensorReading);
This code snippet gets a Table object, which, like the Admin object, is not thread safe, but is lightweight to create. Just as one common developer mistake is creating connections on each request, putting pressure on the RegionServers, another mistake comes from going to great lengths to reuse Table objects across requests. Since these are not thread safe, they create thread local variables to stash these Table objects. This is overkill. It's quite okay to create Table objects for each read/write request and discard them after the request has been serviced:
String key = "/94555/20170308/18:30";
Result sensorReading = table.get(new Get(Bytes.toBytes(key));
Double temperature = Bytes.toDouble(
result.getValue("metrics","temperature"));
This code snippet should be fairly self explanatory. We are providing a row key and getting back the temperature value:
Scan scan = new Scan();
ResultScanner scanner = table.getScanner(scan);
for (Result sensorReading : scanner) {
Double temperature = Bytes.toDouble(sensorReading.getValue("metrics", "temperature"));
}
This code snippet initiates a scan on an HBase table. Once we get a scanner object, we use it to step through the results of the scan. Iterating using the scanner object gives you a Result row. The Result row object can be used to extract individual column values.
Before executing the scan, it's often important to disable scanner caching via the following:
ResultScanner scanner = table.getScanner(scan);
scanner.setCaching(false);
When executing a large scan, data blocks are being read off the disk and brought to memory. By default, these blocks are cached. However, a large sequential scan is unlikely to access the same blocks again, so not only is caching the blocks not useful, we may have evicted other potentially useful blocks from the cache. Hence, we should turn off server-side caching for the blocks that are being read from a scan operation.
Now that we understand how scans can be executed through the API, let's try to understand how to define a filter and execute a scan with the filter:
FilterList filters = new FilterList();
SingleColumnValueFilter filter =
new SingleColumnValueFilter(
Bytes.toBytes("metrics"),
Bytes.toBytes("temperature"),
CompareOp.EQUAL,
Bytes.toBytes(65));
filter.setFilterIfMissing(true);
filters.addFilter(filter);
Scan scan = new Scan();
scan.setFilter(filters);
ResultScanner scanner = table.getScanner(scan);
scanner.setCaching(false);
for (Result sensorReading : scanner) {
Double temperature = Bytes.toDouble(sensorReading.getValue("metrics", "temperature"));
}
In this code snippet, we define a SingleColumnValueFilter to check whether there are any rows where the metrics:temperature column has the value 65. We store this filter in a FilterList object. As you might expect, we can chain together multiple filters and store them within the same FilterList object (we can control whether the filters are applied conjunctively or disjunctively). We then associate the FilterList with the scan object and then execute a scan like we did before.