Table of Contents for
Seven NoSQL Databases in a Week

Version ebook / Retour

Cover image for bash Cookbook, 2nd Edition Seven NoSQL Databases in a Week by Xun Wu Published by Packt Publishing, 2018
  1. Seven NoSQL Databases in a Week
  2. Title Page
  3. Copyright and Credits
  4. Seven NoSQL Databases in a Week
  5. Dedication
  6. Packt Upsell
  7. Why subscribe?
  8. PacktPub.com
  9. Contributors
  10. About the authors
  11. Packt is searching for authors like you
  12. Table of Contents
  13. Preface
  14. Who this book is for
  15. What this book covers
  16. To get the most out of this book
  17. Download the example code files
  18. Download the color images
  19. Conventions used
  20. Get in touch
  21. Reviews
  22. Introduction to NoSQL Databases
  23. Consistency versus availability
  24. ACID guarantees
  25. Hash versus range partition
  26. In-place updates versus appends
  27. Row versus column versus column-family storage models
  28. Strongly versus loosely enforced schemas
  29. Summary
  30. MongoDB
  31. Installing of MongoDB
  32. MongoDB data types
  33. The MongoDB database
  34. MongoDB collections
  35. MongoDB documents
  36. The create operation
  37. The read operation
  38. Applying filters on fields
  39. Applying conditional and logical operators on the filter parameter
  40. The update operation
  41. The delete operation
  42. Data models in MongoDB
  43. The references document data model
  44. The embedded data model
  45. Introduction to MongoDB indexing
  46. The default _id index
  47. Replication
  48. Replication in MongoDB
  49. Automatic failover in replication
  50. Read operations
  51. Sharding
  52. Sharded clusters
  53. Advantages of sharding
  54. Storing large data in MongoDB
  55. Summary
  56. Neo4j
  57. What is Neo4j?
  58. How does Neo4j work?
  59. Features of Neo4j
  60. Clustering
  61. Neo4j Browser
  62. Cache sharding
  63. Help for beginners
  64. Evaluating your use case
  65. Social networks
  66. Matchmaking
  67. Network management
  68. Analytics
  69. Recommendation engines
  70. Neo4j anti-patterns
  71. Applying relational modeling techniques in Neo4j
  72. Using Neo4j for the first time on something mission-critical
  73. Storing entities and relationships within entities
  74. Improper use of relationship types
  75. Storing binary large object data
  76. Indexing everything
  77. Neo4j hardware selection, installation, and configuration
  78. Random access memory
  79. CPU
  80. Disk
  81. Operating system
  82. Network/firewall
  83. Installation
  84. Installing JVM
  85. Configuration
  86. High-availability clustering
  87. Causal clustering
  88. Using Neo4j
  89. Neo4j Browser
  90. Cypher
  91. Python
  92. Java
  93. Taking a backup with Neo4j
  94. Backup/restore with Neo4j Enterprise
  95. Backup/restore with Neo4j Community
  96. Differences between the Neo4j Community and Enterprise Editions
  97. Tips for success
  98. Summary
  99. References 
  100. Redis
  101. Introduction to Redis
  102. What are the key features of Redis?
  103. Performance
  104. Tunable data durability
  105. Publish/Subscribe
  106. Useful data types
  107. Expiring data over time
  108. Counters
  109. Server-side Lua scripting
  110. Appropriate use cases for Redis
  111. Data fits into RAM
  112. Data durability is not a concern
  113. Data at scale
  114. Simple data model
  115. Features of Redis matching part of your use case
  116. Data modeling and application design with Redis
  117. Taking advantage of Redis' data structures
  118. Queues
  119. Sets
  120. Notifications
  121. Counters
  122. Caching
  123. Redis anti-patterns
  124. Dataset cannot fit into RAM
  125. Modeling relational data
  126. Improper connection management
  127. Security
  128. Using the KEYS command
  129. Unnecessary trips over the network
  130. Not disabling THP
  131. Redis setup, installation, and configuration
  132. Virtualization versus on-the-metal
  133. RAM
  134. CPU
  135. Disk
  136. Operating system
  137. Network/firewall
  138. Installation
  139. Configuration files
  140. Using Redis
  141. redis-cli
  142. Lua
  143. Python
  144. Java
  145. Taking a backup with Redis
  146. Restoring from a backup
  147. Tips for success
  148. Summary
  149. References
  150. Cassandra
  151. Introduction to Cassandra
  152. What problems does Cassandra solve?
  153. What are the key features of Cassandra?
  154. No single point of failure
  155. Tunable consistency
  156. Data center awareness
  157. Linear scalability
  158. Built on the JVM
  159. Appropriate use cases for Cassandra
  160. Overview of the internals
  161. Data modeling in Cassandra
  162. Partition keys
  163. Clustering keys
  164. Putting it all together
  165. Optimal use cases
  166. Cassandra anti-patterns
  167. Frequently updated data
  168. Frequently deleted data
  169. Queues or queue-like data
  170. Solutions requiring query flexibility
  171. Solutions requiring full table scans
  172. Incorrect use of BATCH statements
  173. Using Byte Ordered Partitioner
  174. Using a load balancer in front of Cassandra nodes
  175. Using a framework driver
  176. Cassandra hardware selection, installation, and configuration
  177. RAM
  178. CPU
  179. Disk
  180. Operating system
  181. Network/firewall
  182. Installation using apt-get
  183. Tarball installation
  184. JVM installation
  185. Node configuration
  186. Running Cassandra
  187. Adding a new node to the cluster
  188. Using Cassandra
  189. Nodetool
  190. CQLSH
  191. Python
  192. Java
  193. Taking a backup with Cassandra
  194. Restoring from a snapshot
  195. Tips for success
  196. Run Cassandra on Linux
  197. Open ports 7199, 7000, 7001, and 9042
  198. Enable security
  199. Use solid state drives (SSDs) if possible
  200. Configure only one or two seed nodes per data center
  201. Schedule weekly repairs
  202. Do not force a major compaction
  203. Remember that every mutation is a write
  204. The data model is key
  205. Consider a support contract
  206. Cassandra is not a general purpose database
  207. Summary
  208. References
  209. HBase
  210. Architecture
  211. Components in the HBase stack
  212. Zookeeper
  213. HDFS
  214. HBase master
  215. HBase RegionServers
  216. Reads and writes
  217. The HBase write path
  218. HBase writes – design motivation
  219. The HBase read path
  220. HBase compactions
  221. System trade-offs
  222. Logical and physical data models
  223. Interacting with HBase – the HBase shell
  224. Interacting with HBase – the HBase Client API
  225. Interacting with secure HBase clusters
  226. Advanced topics
  227. HBase high availability
  228. Replicated reads
  229. HBase in multiple regions
  230. HBase coprocessors
  231. SQL over HBase
  232. Summary
  233. DynamoDB
  234. The difference between SQL and DynamoDB
  235. Setting up DynamoDB
  236. Setting up locally
  237. Setting up using AWS
  238. The difference between downloadable DynamoDB and DynamoDB web services
  239. DynamoDB data types and terminology
  240. Tables, items, and attributes
  241. Primary key
  242. Secondary indexes
  243. Streams
  244. Queries
  245. Scan
  246. Data types
  247. Data models and CRUD operations in DynamoDB
  248. Limitations of DynamoDB
  249. Best practices
  250. Summary
  251. InfluxDB
  252. Introduction to InfluxDB
  253. Key concepts and terms of InfluxDB
  254. Data model and storage engine
  255. Storage engine
  256. Installation and configuration
  257. Installing InfluxDB
  258. Configuring InfluxDB
  259. Production deployment considerations
  260. Query language and API
  261. Query language
  262. Query pagination
  263. Query performance optimizations
  264. Interaction via Rest API
  265. InfluxDB API client
  266. InfluxDB with Java client
  267. InfluxDB with a Python client
  268. InfluxDB with Go client
  269. InfluxDB ecosystem
  270. Telegraf
  271. Telegraf data management
  272. Kapacitor
  273. InfluxDB operations
  274. Backup and restore
  275. Backups
  276. Restore
  277. Clustering and HA
  278. Retention policy
  279. Monitoring
  280. Summary
  281. Other Books You May Enjoy
  282. Leave a review - let other readers know what you think

Query language

InfluxDB provides an SQL-like query language; it is used for querying time-series data. It also supports HTTP APIs for write and performs admin-related work.

Let's use the InfluxDB CLI tool to connect to an InfluxDB instance and run some queries.

Start and connect to the InfluxDB instance by typing the following commands:

sudo service InfluxDB start
$ influx -precision rfc3339

By default, InfluxDB shows the time as nanosecond UTC value, it is a very long number, like 1511815800000000000. The argument -precision rfc3339 is for the display time field as a human readable format - YYYY-MM-DDTHH:MM:SS.nnnnnnnnnZ:

Connected to http://localhost:8086 version 1.5
InfluxDB shell version: 1.5
>

We can check available databases by using the show databases function:

> show databases;
name: databases
name
----
_internal

To use the command to switch to an existing database, you can type the following command:

use _internal

This command will switch to the existing database and set your default database to _internal. Doing this means all subsequent commands you write to CLI will automatically be executed on the _internal database until you switch to another database.

So far, we haven't created any new database. At this stage, it shows the default database _internal provided by the product itself. If you want to view all of the available measurements for the current database, you can use the following command:

show measurements

Let's quickly browse the measurements under the _internal database.

> use _internal
Using database _internal
> show measurements
name: measurements
name
----
cq
database
httpd
queryExecutor
runtime
shard
subscriber
tsm1_cache
tsm1_engine
tsm1_filestore
tsm1_wal
write

We can see that the _internal database contains lots of system-level statistical data, which can store very useful system information and help you analyze the system information.

For example, if you want to check heap usage over the time, you can run an SQL-like SELECT statement command to query and get more information:

SELECT HeapAlloc, HeapIdle, HeapInUse, NumGC, Mallocs, TotalAlloc from runtime limit 3;

It's time to create our first InfluxDB database! In our sample project, we will use real-time intra day stock quotes market data to set up our lab. The sample data contains November 27, 2017 to November 29, 2017 market stock data for three companies: Apple Inc, JPMorgan Chase & Co., and Alphabet Inc. The interval is 60 seconds.

Here is the example data format:

tickers,ticker=JPM company="JPMorgan Chase & Co",close=98.68,high=98.7,low=98.68,open=98.7,volume=4358 151179804
tickers,ticker=JPM company="JPMorgan Chase & Co",close=98.71,high=98.7,low=98.68,open=98.68,volume=1064 151179810

In the data shown here, we separated measurement, tag set and field set use comma, each tag set and field set are separated by space: measurement, tag set..., field set.... . Define the ticker as tag set and define company, close, high, low, open, volume as fields set. The last value is UTC time. The measurement name is tickers.

Download the sample data to the Unix box:

/tmp/data/ticker_data.txt

Then, load the data to the market database:

/tmp/data$ influx -import -path=/tmp/data/ticker_data.txt -database=market -precision=s
2017/11/30 12:13:17 Processed 1 commands
2017/11/30 12:13:17 Processed 3507 inserts
2017/11/30 12:13:17 Failed 0 inserts

In InfluxDB CLI, verify the new measurement tickers that are created:

> show measurements;
name: measurements
name
----
tickers

The SELECT statement is used to retrieve data points from one or more measurements. The InfluxDB query syntax is as follows:

SELECT <field_key>[,<field_key>,<tag_key>] FROM <measurement_name>[,<measurement_name>] WHERE <conditional_expression> [(AND|OR) <conditional_expression> [...]]

The following are the arguments for SELECT:

Name

Description

SELECT

This statement is used to fetch data points from one or more measurements

*

This indicates all tag keys and field keys and the time

field_key

This returns a specific field

tag_key

This returns a specific tag

SELECT "<field_key>"::field,"<tag_key>"::tag

This returns a specific field and tag; the ::[field | tag] syntax specifies the identifier's type

FROM <measurement_name>,<measurement_name>

This returns data from one or more measurements

FROM <database_name>.<retention_policy_name>.<measurement_name>

This returns a fully qualified measurement

FROM <database_name>..<measurement_name>

This returns specific database measurement with default RP

 

Let's look at some examples for a better understanding.

The query selects all of the fields, tags, and time from tickers measurements:

> select * from tickers;
name: tickers
time close company high low open ticker volume
---- ----- ------- ---- --- ---- ------ ------
2017-11-27T05:00:00Z 174.05 APPLE INC 174.105 174.05 174.06 AAPL 34908
2017-11-27T05:00:00Z 98.42 JPMorgan Chase & Co 98.44 98.4 98.4 JPM 7904
2017-11-27T05:00:00Z 1065.06 Alphabet Inc 1065.06 1065.06 1065.06 GOOGL 100
[....]

The query selects the field key and tag key from the tickers measurements:

> select ticker, company, close, volume from tickers;
name: tickers
time ticker company close volume
---- ------ ------- ----- ------
2017-11-27T05:00:00Z AAPL APPLE INC 174.05 34908
2017-11-27T05:00:00Z JPM JPMorgan Chase & Co 98.42 7904
2017-11-27T05:00:00Z GOOGL Alphabet Inc 1065.06 100
2017-11-27T05:01:00Z JPM JPMorgan Chase & Co 98.365 20155
[...]

The query selects the field key and tag key based on the identifier type from tickers measurements:

> SELECT ticker::tag,company::field,volume::field,close::field FROM tickers limit 3;
name: tickers
time ticker company volume close
---- ------ ------- ------ -----
2017-11-27T05:00:00Z AAPL APPLE INC 34908 174.05
2017-11-27T05:00:00Z JPM JPMorgan Chase & Co 7904 98.42
2017-11-27T05:00:00Z GOOGL Alphabet Inc 100 1065.06

The query selects all of the fields based on the identifier type from tickers measurements:

> SELECT *::field FROM tickers limit 3;
name: tickers
time close company high low open volume
---- ----- ------- ---- --- ---- ------
2017-11-27T05:00:00Z 174.05 APPLE INC 174.105 174.05 174.06 34908
2017-11-27T05:00:00Z 98.42 JPMorgan Chase & Co 98.44 98.4 98.4 7904
2017-11-27T05:00:00Z 1065.06 Alphabet Inc 1065.06 1065.06 1065.06 100

The query selects fully qualified ticker measurements:

> SELECT ticker, company, volume, close FROM market.autogen.tickers limit 3;
name: tickers
time ticker company volume close
---- ------ ------- ------ -----
2017-11-27T05:00:00Z AAPL APPLE INC 34908 174.05
2017-11-27T05:00:00Z JPM JPMorgan Chase & Co 7904 98.42
2017-11-27T05:00:00Z GOOGL Alphabet Inc 100 1065.06

The following is a list of arguments for WHERE:

WHERE supported operators

Operation

=

equal to

<>

not equal to

!=

not equal to

>

greater than

>=

greater than or equal to

<

less than

<=

less than or equal to

Tags:

tag_key <operator> ['tag_value']

Single quote for tag value, since tag value is of string type

timestamps

The default time range is between 1677-09-21 00:12:43.145224194 and 2262-04-11T23:47:16.854775806Z UTC

Using any of =, !=, <, >, <=, >=, <> in the SELECT clause yields empty results for all types.

Let's look at some examples of this argument for a better understanding.

The query returns all tickers when high is larger than 174.6 and ticker is AAPL:

> SELECT * FROM tickers WHERE ticker='AAPL' AND high> 174.6 limit 3;
name: tickers
time close company high low open ticker volume
---- ----- ------- ---- --- ---- ------ ------
2017-11-27T14:30:00Z 175.05 APPLE INC 175.06 174.95 175.05 AAPL 261381
2017-11-27T14:31:00Z 174.6 APPLE INC 175.05 174.6 175 AAPL 136420
2017-11-27T14:32:00Z 174.76 APPLE INC 174.77 174.56 174.6003 AAPL 117050

The query performs basic arithmetic to find out the date when the ticker is JPM and high price is more than the low price by 0.2%:

> SELECT * FROM tickers WHERE ticker='JPM' AND high> low* 1.002 limit 5;
name: tickers
time close company high low open ticker volume
---- ----- ------- ---- --- ---- ------ ------
2017-11-27T14:31:00Z 98.14 JPMorgan Chase & Co 98.31 98.09 98.31 JPM 136485
2017-11-27T14:32:00Z 98.38 JPMorgan Chase & Co 98.38 98.11 98.12 JPM 27837
2017-11-27T14:33:00Z 98.59 JPMorgan Chase & Co 98.59 98.34 98.38 JPM 47042
2017-11-27T14:41:00Z 98.89 JPMorgan Chase & Co 98.9 98.68 98.7 JPM 51245
2017-11-29T05:05:00Z 103.24 JPMorgan Chase & Co 103.44 103.22 103.44 JPM 73835

InfulxDB supports group by. The GROUP BY statement is often used with aggregate functions (COUNT, MEAN, MAX, MIN, SUM, AVG) to group the result set by one or more tag keys.

The following is the syntax:

SELECT_clause FROM_clause [WHERE_clause] GROUP BY [* | <tag_key>[,<tag_key]]

The following are the arguments for the GROUP BY TAG_KY:

GROUP BY *

Groups by all tags

GROUP BY <tag_key>

Groups by a specific tag

GROUP BY <tag_key>,<tag_key>

Groups by multiple specific tags

Supported Aggregation functions:

COUNT(), DISTINCT(), INTEGRAL(), MEAN(),

MEDIAN(), MODE()SPREAD(), STDDEV(), SUM()

Supported Selector functions:

BOTTOM(), FIRST(), LAST(), MAX()

MIN(), PERCENTILE(), SAMPLE(), TOP()

 

Consider the following example:

The query finds all tickers' maximum volumes, minimum volumes, and median volumes during 10-19, November 27, 2017:

> select MAX(volume), MIN(volume), MEDIAN(volume) from tickers where time <= '2017-11-27T19:00:00.000000000Z' AND time >= '2017-11-27T10:00:00.000000000Z' group by ticker;
name: tickers
tags: ticker=AAPL
time max min median
---- --- --- ------
2017-11-27T10:00:00Z 462796 5334 32948
name: tickers
tags: ticker=GOOGL
time max min median
---- --- --- ------
2017-11-27T10:00:00Z 22062 200 1950
name: tickers
tags: ticker=JPM
time max min median
---- --- --- ------
2017-11-27T10:00:00Z 136485 2400 14533

By change group by *, it will group the query result for all tags:

> select MAX(volume), MIN(volume), MEDIAN(volume) from tickers where time <= '2017-11-27T19:00:00.000000000Z' AND time >= '2017-11-27T10:00:00.000000000Z' group by *;
name: tickers
tags: ticker=AAPL
time max min median
---- --- --- ------
2017-11-27T10:00:00Z 462796 5334 32948
name: tickers
tags: ticker=GOOGL
time max min median
---- --- --- ------
2017-11-27T10:00:00Z 22062 200 1950
name: tickers
tags: ticker=JPM
time max min median
---- --- --- ------
2017-11-27T10:00:00Z 136485 2400 14533

InfluxDB supports the arguments for GROUP BY - time interval, the following is the syntax:

SELECT <function>(<field_key>) FROM_clause WHERE <time_range> GROUP BY time(<time_interval>),[tag_key] [fill(<fill_option>)]

Syntax

Descrption

time_range

Range of timestamp field, time > past time and < future time

time_interval

Duration literal, group query result into the duration time group

fill(<fill_option>)

Optional. When time intervals have no data, it can change to fill the value

 

Let's look at some examples for a better understanding.

Find stock trade volumes' mean value during 10-19, November 27, 2017, and group the results into 60-minute intervals and ticker tag key:

> select mean(volume) from tickers where time <= '2017-11-27T19:00:00.000000000Z' AND time >= '2017-11-27T10:00:00.000000000Z' group by time(60m), ticker;
name: tickers
tags: ticker=AAPL
time mean
---- ----
2017-11-27T10:00:00Z
2017-11-27T11:00:00Z
2017-11-27T12:00:00Z
2017-11-27T13:00:00Z
2017-11-27T14:00:00Z 139443.46666666667
2017-11-27T15:00:00Z 38968.916666666664
2017-11-27T16:00:00Z 44737.816666666666
2017-11-27T17:00:00Z
2017-11-27T18:00:00Z 16202.15
2017-11-27T19:00:00Z 12651
name: tickers
tags: ticker=GOOGL
time mean
---- ----
2017-11-27T10:00:00Z
2017-11-27T11:00:00Z
2017-11-27T12:00:00Z
2017-11-27T13:00:00Z
2017-11-27T14:00:00Z 6159.033333333334
2017-11-27T15:00:00Z 2486.75
2017-11-27T16:00:00Z 2855.5
2017-11-27T17:00:00Z
2017-11-27T18:00:00Z 1954.1166666666666
2017-11-27T19:00:00Z 300
name: tickers
tags: ticker=JPM
time mean
---- ----
2017-11-27T10:00:00Z
2017-11-27T11:00:00Z
2017-11-27T12:00:00Z
2017-11-27T13:00:00Z
2017-11-27T14:00:00Z 35901.8275862069
2017-11-27T15:00:00Z 17653.45
2017-11-27T16:00:00Z 14139.25
2017-11-27T17:00:00Z
2017-11-27T18:00:00Z 14419.416666666666
2017-11-27T19:00:00Z 4658

InfluxDB supports the INTO clause. You can copy from one measurement to another or create a new database for measurement.

The following is the syntax:

SELECT_clause INTO <measurement_name> FROM_clause [WHERE_clause] [GROUP_BY_clause]

Syntax

Description

INTO <measurement_name>

This helps you to copy data into a specific measurement.

INTO <database_name>.<retention_policy_name>.<measurement_name>

This helps you to copy data into a fully qualified measurement.

INTO <database_name>..<measurement_name>

This helps you to copy data into a specified measurement with a default RP.

INTO <database_name>.<retention_policy_name>.:MEASUREMENT FROM /<regular_expression>/

This helps you to copy data into a specified database with RP and using regular expression match from the clause.

 

Let's take a look at some examples for a better understanding.

Copy the aggregate mean of all of the field values into a different database and create a new measurement:

> use market
Using database market
> SELECT MEAN(*) INTO market_watch.autogen.avgvol FROM /.*/ WHERE time >= '2017-11-27T00:00:00.000000000Z' AND time <= '2017-11-27T23:59:59.000000000Z' GROUP BY time(60m);
name: result
time written
---- -------
1970-01-01T00:00:00Z 8
> use market_watch
Using database market_watch
> show measurements;
name: measurements
name
----
avgvol
tickers
> select * from avgvol
name: avgvol
time mean_close mean_high mean_low mean_open mean_volume
---- ---------- --------- -------- --------- -----------
2017-11-27T05:00:00Z 435.3618644067796 435.43688926553676 435.27535762711864 435.33908870056507 16300.42372881356
2017-11-27T14:00:00Z 448.45324269662916 448.67904269662927 448.16110786516856 448.42909325842686 60777.84269662921
2017-11-27T15:00:00Z 446.46472277777775 446.5794283333334 446.3297522222223 446.4371933333333 19703.03888888889
2017-11-27T16:00:00Z 445.9267622222222 446.0731233333333 445.7768644444446 445.9326772222222 20577.522222222222
2017-11-27T18:00:00Z 447.22421222222226 447.29334333333327 447.1268016666667 447.1964166666667 10858.56111111111
2017-11-27T19:00:00Z 440.0175634831461 440.1042516853933 439.9326162921348 440.02026348314615 11902.247191011236
2017-11-27T20:00:00Z 447.3597166666667 447.43757277777786 447.2645472222222 447.3431649999999 19842.32222222222
2017-11-27T21:00:00Z 448.01033333333334 448.43833333333333 448.00333333333333 448.29333333333335 764348.3333333334