HBase has a powerful feature called coprocessors, which we will briefly explain to raise awareness with the reader, but a deeper coverage is outside the scope of this book.
One of the tenets of large-scale data processing is to ensure that the analytics are executed as close to the data layer as possible in order to avoid moving large amounts of data to where the processing is being done. HBase filters are an example of server-side processing that reduces the amount of data flowing back into the client.
HBase offers a set of constructs called coprocessors that allow for arbitrary server-side processing. Since this is arbitrary code running without any sandboxing within the RegionServers, they can also be a source of instability if proper deployment and testing procedures are not followed. Coprocessors are of two types: observers and endpoints.
Observers are similar to triggers in traditional RDBMs. They allow for some business logic to be executed before or after operations such as reads and writes are performed on a table. The following are some examples of observers and their use:
- If we wanted to do some permission checking on whether a given user is allowed to access data for a given key, the permission checking could be implemented in a preGet().
- If we wanted to do post-processing on the value for a given key, it could be done in a postGet().
- If we wanted to update a secondary index after the base data table has been updated, it could be done in a postPut().
The other kind of coprocessor is endpoints, which are similar to stored procedures in RDBMSs. Endpoints can be invoked via distributed RPC calls and will run concurrently on all of the RegionServers. For example, if you wanted to average the values in a given column in a table, the average operation can be implemented as an endpoint coprocessor. In general, any distributive operation (such as map-reduce) can be expressed within an endpoint coprocessor.