Table of Contents for
Seven NoSQL Databases in a Week

Version ebook / Retour

Cover image for bash Cookbook, 2nd Edition Seven NoSQL Databases in a Week by Xun Wu Published by Packt Publishing, 2018
  1. Seven NoSQL Databases in a Week
  2. Title Page
  3. Copyright and Credits
  4. Seven NoSQL Databases in a Week
  5. Dedication
  6. Packt Upsell
  7. Why subscribe?
  8. PacktPub.com
  9. Contributors
  10. About the authors
  11. Packt is searching for authors like you
  12. Table of Contents
  13. Preface
  14. Who this book is for
  15. What this book covers
  16. To get the most out of this book
  17. Download the example code files
  18. Download the color images
  19. Conventions used
  20. Get in touch
  21. Reviews
  22. Introduction to NoSQL Databases
  23. Consistency versus availability
  24. ACID guarantees
  25. Hash versus range partition
  26. In-place updates versus appends
  27. Row versus column versus column-family storage models
  28. Strongly versus loosely enforced schemas
  29. Summary
  30. MongoDB
  31. Installing of MongoDB
  32. MongoDB data types
  33. The MongoDB database
  34. MongoDB collections
  35. MongoDB documents
  36. The create operation
  37. The read operation
  38. Applying filters on fields
  39. Applying conditional and logical operators on the filter parameter
  40. The update operation
  41. The delete operation
  42. Data models in MongoDB
  43. The references document data model
  44. The embedded data model
  45. Introduction to MongoDB indexing
  46. The default _id index
  47. Replication
  48. Replication in MongoDB
  49. Automatic failover in replication
  50. Read operations
  51. Sharding
  52. Sharded clusters
  53. Advantages of sharding
  54. Storing large data in MongoDB
  55. Summary
  56. Neo4j
  57. What is Neo4j?
  58. How does Neo4j work?
  59. Features of Neo4j
  60. Clustering
  61. Neo4j Browser
  62. Cache sharding
  63. Help for beginners
  64. Evaluating your use case
  65. Social networks
  66. Matchmaking
  67. Network management
  68. Analytics
  69. Recommendation engines
  70. Neo4j anti-patterns
  71. Applying relational modeling techniques in Neo4j
  72. Using Neo4j for the first time on something mission-critical
  73. Storing entities and relationships within entities
  74. Improper use of relationship types
  75. Storing binary large object data
  76. Indexing everything
  77. Neo4j hardware selection, installation, and configuration
  78. Random access memory
  79. CPU
  80. Disk
  81. Operating system
  82. Network/firewall
  83. Installation
  84. Installing JVM
  85. Configuration
  86. High-availability clustering
  87. Causal clustering
  88. Using Neo4j
  89. Neo4j Browser
  90. Cypher
  91. Python
  92. Java
  93. Taking a backup with Neo4j
  94. Backup/restore with Neo4j Enterprise
  95. Backup/restore with Neo4j Community
  96. Differences between the Neo4j Community and Enterprise Editions
  97. Tips for success
  98. Summary
  99. References 
  100. Redis
  101. Introduction to Redis
  102. What are the key features of Redis?
  103. Performance
  104. Tunable data durability
  105. Publish/Subscribe
  106. Useful data types
  107. Expiring data over time
  108. Counters
  109. Server-side Lua scripting
  110. Appropriate use cases for Redis
  111. Data fits into RAM
  112. Data durability is not a concern
  113. Data at scale
  114. Simple data model
  115. Features of Redis matching part of your use case
  116. Data modeling and application design with Redis
  117. Taking advantage of Redis' data structures
  118. Queues
  119. Sets
  120. Notifications
  121. Counters
  122. Caching
  123. Redis anti-patterns
  124. Dataset cannot fit into RAM
  125. Modeling relational data
  126. Improper connection management
  127. Security
  128. Using the KEYS command
  129. Unnecessary trips over the network
  130. Not disabling THP
  131. Redis setup, installation, and configuration
  132. Virtualization versus on-the-metal
  133. RAM
  134. CPU
  135. Disk
  136. Operating system
  137. Network/firewall
  138. Installation
  139. Configuration files
  140. Using Redis
  141. redis-cli
  142. Lua
  143. Python
  144. Java
  145. Taking a backup with Redis
  146. Restoring from a backup
  147. Tips for success
  148. Summary
  149. References
  150. Cassandra
  151. Introduction to Cassandra
  152. What problems does Cassandra solve?
  153. What are the key features of Cassandra?
  154. No single point of failure
  155. Tunable consistency
  156. Data center awareness
  157. Linear scalability
  158. Built on the JVM
  159. Appropriate use cases for Cassandra
  160. Overview of the internals
  161. Data modeling in Cassandra
  162. Partition keys
  163. Clustering keys
  164. Putting it all together
  165. Optimal use cases
  166. Cassandra anti-patterns
  167. Frequently updated data
  168. Frequently deleted data
  169. Queues or queue-like data
  170. Solutions requiring query flexibility
  171. Solutions requiring full table scans
  172. Incorrect use of BATCH statements
  173. Using Byte Ordered Partitioner
  174. Using a load balancer in front of Cassandra nodes
  175. Using a framework driver
  176. Cassandra hardware selection, installation, and configuration
  177. RAM
  178. CPU
  179. Disk
  180. Operating system
  181. Network/firewall
  182. Installation using apt-get
  183. Tarball installation
  184. JVM installation
  185. Node configuration
  186. Running Cassandra
  187. Adding a new node to the cluster
  188. Using Cassandra
  189. Nodetool
  190. CQLSH
  191. Python
  192. Java
  193. Taking a backup with Cassandra
  194. Restoring from a snapshot
  195. Tips for success
  196. Run Cassandra on Linux
  197. Open ports 7199, 7000, 7001, and 9042
  198. Enable security
  199. Use solid state drives (SSDs) if possible
  200. Configure only one or two seed nodes per data center
  201. Schedule weekly repairs
  202. Do not force a major compaction
  203. Remember that every mutation is a write
  204. The data model is key
  205. Consider a support contract
  206. Cassandra is not a general purpose database
  207. Summary
  208. References
  209. HBase
  210. Architecture
  211. Components in the HBase stack
  212. Zookeeper
  213. HDFS
  214. HBase master
  215. HBase RegionServers
  216. Reads and writes
  217. The HBase write path
  218. HBase writes – design motivation
  219. The HBase read path
  220. HBase compactions
  221. System trade-offs
  222. Logical and physical data models
  223. Interacting with HBase – the HBase shell
  224. Interacting with HBase – the HBase Client API
  225. Interacting with secure HBase clusters
  226. Advanced topics
  227. HBase high availability
  228. Replicated reads
  229. HBase in multiple regions
  230. HBase coprocessors
  231. SQL over HBase
  232. Summary
  233. DynamoDB
  234. The difference between SQL and DynamoDB
  235. Setting up DynamoDB
  236. Setting up locally
  237. Setting up using AWS
  238. The difference between downloadable DynamoDB and DynamoDB web services
  239. DynamoDB data types and terminology
  240. Tables, items, and attributes
  241. Primary key
  242. Secondary indexes
  243. Streams
  244. Queries
  245. Scan
  246. Data types
  247. Data models and CRUD operations in DynamoDB
  248. Limitations of DynamoDB
  249. Best practices
  250. Summary
  251. InfluxDB
  252. Introduction to InfluxDB
  253. Key concepts and terms of InfluxDB
  254. Data model and storage engine
  255. Storage engine
  256. Installation and configuration
  257. Installing InfluxDB
  258. Configuring InfluxDB
  259. Production deployment considerations
  260. Query language and API
  261. Query language
  262. Query pagination
  263. Query performance optimizations
  264. Interaction via Rest API
  265. InfluxDB API client
  266. InfluxDB with Java client
  267. InfluxDB with a Python client
  268. InfluxDB with Go client
  269. InfluxDB ecosystem
  270. Telegraf
  271. Telegraf data management
  272. Kapacitor
  273. InfluxDB operations
  274. Backup and restore
  275. Backups
  276. Restore
  277. Clustering and HA
  278. Retention policy
  279. Monitoring
  280. Summary
  281. Other Books You May Enjoy
  282. Leave a review - let other readers know what you think

Interacting with HBase – the HBase shell

The best way to get started with understanding HBase is through the HBase shell.

Before we do that, we need to first install HBase. An easy way to get started is to use the Hortonworks sandbox. You can download the sandbox for free from https://hortonworks.com/products/sandbox/. The sandbox can be installed on Linux, Mac and Windows. Follow the instructions to get this set up. 

On any cluster where the HBase client or server is installed, type hbase shell to get a prompt into HBase:

hbase(main):004:0> version
1.1.2.2.3.6.2-3, r2873b074585fce900c3f9592ae16fdd2d4d3a446, Thu Aug 4 18:41:44 UTC 2016

This tells you the version of HBase that is running on the cluster. In this instance, the HBase version is 1.1.2, provided by a particular Hadoop distribution, in this case HDP 2.3.6:

hbase(main):001:0> help
HBase Shell, version 1.1.2.2.3.6.2-3, r2873b074585fce900c3f9592ae16fdd2d4d3a446, Thu Aug 4 18:41:44 UTC 2016
Type 'help "COMMAND"', (e.g. 'help "get"' -- the quotes are necessary) for help on a specific command.
Commands are grouped. Type 'help "COMMAND_GROUP"', (e.g. 'help "general"') for help on a command group.

This provides the set of operations that are possible through the HBase shell, which includes DDL, DML, and admin operations.

hbase(main):001:0> create 'sensor_telemetry', 'metrics'
0 row(s) in 1.7250 seconds
=> Hbase::Table - sensor_telemetry

This creates a table called sensor_telemetry, with a single column family called metrics. As we discussed before, HBase doesn't require column names to be defined in the table schema (and in fact, has no provision for you to be able to do so):

hbase(main):001:0> describe 'sensor_telemetry'
Table sensor_telemetry is ENABLED
sensor_telemetry
COLUMN FAMILIES DESCRIPTION
{NAME => 'metrics', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false',
KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING
=> 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0',
BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE =>'0'}
1 row(s) in 0.5030 seconds

This describes the structure of the sensor_telemetry table. The command output indicates that there's a single column family present called metrics, with various attributes defined on it.

BLOOMFILTER indicates the type of bloom filter defined for the table, which can either be a bloom filter of the ROW type, which probes for the presence/absence of a given row key, or of the ROWCOL type, which probes for the presence/absence of a given row key, col-qualifier combination. You can also choose to have BLOOMFILTER set to None.

The BLOCKSIZE configures the minimum granularity of an HBase read. By default, the block size is 64 KB, so if the average cells are less than 64 KB, and there's not much locality of reference, you can lower your block size to ensure there's not more I/O than necessary, and more importantly, that your block cache isn't wasted on data that is not needed.

VERSIONS refers to the maximum number of cell versions that are to be kept around:

hbase(main):004:0> alter 'sensor_telemetry', {NAME => 'metrics', BLOCKSIZE => '16384', COMPRESSION => 'SNAPPY'}
Updating all regions with the new schema...
1/1 regions updated.
Done.
0 row(s) in 1.9660 seconds

Here, we are altering the table and column family definition to change the BLOCKSIZE to be 16 K and the COMPRESSION codec to be SNAPPY:

hbase(main):004:0> version
1.1.2.2.3.6.2-3, r2873b074585fce900c3f9592ae16fdd2d4d3a446, Thu Aug 4 18:41:44 UTC 2016 hbase(main):005:0> describe 'sensor_telemetry'
Table sensor_telemetry is
ENABLED
sensor_telemetry
COLUMN FAMILIES DESCRIPTION
{NAME => 'metrics', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false',
KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING
=> 'NONE', TTL => 'FOREVER', COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0',
BLOCKCACHE => 'true', BLOCKSIZE => '16384', REPLICATION_SCOPE => '0'}
1 row(s) in 0.0410 seconds

This is what the table definition now looks like after our ALTER table statement. Next, let's scan the table to see what it contains:

hbase(main):007:0> scan 'sensor_telemetry'
ROW COLUMN+CELL
0 row(s) in 0.0750 seconds

No surprises, the table is empty. So, let's populate some data into the table:

hbase(main):007:0> put 'sensor_telemetry', '/94555/20170308/18:30', 'temperature', '65'
ERROR: Unknown column family! Valid column names: metrics:*

Here, we are attempting to insert data into the sensor_telemetry table. We are attempting to store the value '65' for the column qualifier 'temperature' for a row key '/94555/20170308/18:30'. This is unsuccessful because the column 'temperature' is not associated with any column family.

In HBase, you always need the row key, the column family and the column qualifier to uniquely specify a value. So, let's try this again:

hbase(main):008:0> put 'sensor_telemetry', '/94555/20170308/18:30',
'metrics:temperature', '65'
0 row(s) in 0.0120 seconds

Ok, that seemed to be successful. Let's confirm that we now have some data in the table:

hbase(main):009:0> count 'sensor_telemetry'
1 row(s) in 0.0620 seconds
=> 1

Ok, it looks like we are on the right track. Let's scan the table to see what it contains:

hbase(main):010:0> scan 'sensor_telemetry'
ROW COLUMN+CELL
/94555/20170308/18:30 column=metrics:temperature, timestamp=1501810397402,value=65
1 row(s) in 0.0190 seconds

This tells us we've got data for a single row and a single column. The insert time epoch in milliseconds was 1501810397402.

In addition to a scan operation, which scans through all of the rows in the table, HBase also provides a get operation, where you can retrieve data for one or more rows, if you know the keys:

hbase(main):011:0> get 'sensor_telemetry', '/94555/20170308/18:30'
COLUMN CELL
metrics:temperature timestamp=1501810397402, value=65

OK, that returns the row as expected. Next, let's look at the effect of cell versions. As we've discussed before, a value in HBase is defined by a combination of Row-key, Col-family, Col-qualifier, Timestamp.

To understand this, let's insert the value '66', for the same row key and column qualifier as before:

hbase(main):012:0> put 'sensor_telemetry', '/94555/20170308/18:30',
'metrics:temperature', '66'
0 row(s) in 0.0080 seconds

Now let's read the value for the row key back:

hbase(main):013:0> get 'sensor_telemetry', '/94555/20170308/18:30'
COLUMN CELL
metrics:temperature timestamp=1501810496459,
value=66
1 row(s) in 0.0130 seconds

This is in line with what we expect, and this is the standard behavior we'd expect from any database. A put in HBase is the equivalent to an upsert in an RDBMS. Like an upsert, put inserts a value if it doesn't already exist and updates it if a prior value exists.

Now, this is where things get interesting. The get operation in HBase allows us to retrieve data associated with a particular timestamp:

hbase(main):015:0> get 'sensor_telemetry', '/94555/20170308/18:30', {COLUMN =>
'metrics:temperature', TIMESTAMP => 1501810397402}
COLUMN CELL
metrics:temperature timestamp=1501810397402,value=65
1 row(s) in 0.0120 seconds

We are able to retrieve the old value of 65 by providing the right timestamp. So, puts in HBase don't overwrite the old value, they merely hide it; we can always retrieve the old values by providing the timestamps.

Now, let's insert more data into the table:

hbase(main):028:0> put 'sensor_telemetry', '/94555/20170307/18:30',
'metrics:temperature', '43'
0 row(s) in 0.0080 seconds
hbase(main):029:0> put 'sensor_telemetry', '/94555/20170306/18:30',
'metrics:temperature', '33'
0 row(s) in 0.0070 seconds

Now, let's scan the table back:

hbase(main):030:0> scan 'sensor_telemetry' 
ROW COLUMN+CELL
/94555/20170306/18:30 column=metrics:temperature, timestamp=1501810843956, value=33
/94555/20170307/18:30 column=metrics:temperature, timestamp=1501810835262, value=43
/94555/20170308/18:30 column=metrics:temperature, timestamp=1501810615941,value=67
3 row(s) in 0.0310 seconds

We can also scan the table in reverse key order:

hbase(main):031:0> scan 'sensor_telemetry', {REVERSED => true} 
ROW COLUMN+CELL
/94555/20170308/18:30 column=metrics:temperature, timestamp=1501810615941, value=67
/94555/20170307/18:30 column=metrics:temperature, timestamp=1501810835262, value=43
/94555/20170306/18:30 column=metrics:temperature, timestamp=1501810843956,value=33
3 row(s) in 0.0520 seconds

What if we wanted all the rows, but in addition, wanted all the cell versions from each row? We can easily retrieve that:

hbase(main):032:0> scan 'sensor_telemetry', {RAW => true, VERSIONS => 10} 
ROW COLUMN+CELL
/94555/20170306/18:30 column=metrics:temperature, timestamp=1501810843956, value=33
/94555/20170307/18:30 column=metrics:temperature, timestamp=1501810835262, value=43
/94555/20170308/18:30 column=metrics:temperature, timestamp=1501810615941, value=67
/94555/20170308/18:30 column=metrics:temperature, timestamp=1501810496459, value=66
/94555/20170308/18:30 column=metrics:temperature, timestamp=1501810397402, value=65

Here, we are retrieving all three values of the row key /94555/20170308/18:30 in the scan result set.

HBase scan operations don't need to go from the beginning to the end of the table; you can optionally specify the row to start scanning from and the row to stop the scan operation at:

hbase(main):034:0> scan 'sensor_telemetry', {STARTROW => '/94555/20170307'} 
ROW COLUMN+CELL
/94555/20170307/18:30 column=metrics:temperature, timestamp=1501810835262, value=43
/94555/20170308/18:30 column=metrics:temperature, timestamp=1501810615941, value=67
2 row(s) in 0.0550 seconds

HBase also provides the ability to supply filters to the scan operation to restrict what rows are returned by the scan operation. It's possible to implement your own filters, but there's rarely a need to. There's a large collection of filters that are already implemented:

hbase(main):033:0> scan 'sensor_telemetry', {ROWPREFIXFILTER => '/94555/20170307'} 
ROW COLUMN+CELL
/94555/20170307/18:30 column=metrics:temperature, timestamp=1501810835262, value=43
1 row(s) in 0.0300 seconds

This returns all the rows whose keys have the prefix /94555/20170307:

hbase(main):033:0> scan 'sensor_telemetry', { FILTER =>     
   SingleColumnValueFilter.new( 
         Bytes.toBytes('metrics'),       
         Bytes.toBytes('temperature'),  
         CompareFilter::CompareOp.valueOf('EQUAL'), 
         BinaryComparator.new(Bytes.toBytes('66')))} 

The SingleColumnValueFilter can be used to scan a table and look for all rows with a given column value: