Seven NoSQL Databases in a Week

The preceding figure showed that write is stored both in memory and on disk. Periodically, the data is flushed from memory to disk:

The main thing to remember is that Cassandra writes its sorted string data files (SSTable files) as immutable. That is, they are written once, and never modified. When an SSTable file reaches its maximum capacity, another is written. Therefore, if data for a specific key has been written several times, it may exist in multiple SSTable files, which will all have to be reconciled at read-time.

Additionally, deletes in Cassandra are written to disk in structures known as tombstones. A tombstone is essentially a timestamped placeholder for a delete. The tombstone gets replicated out to all of the other nodes responsible for the deleted data. This way, reads for that key will return consistent results, and prevent the problems associated with ghost data.

Eventually, SSTable files are merged together and tombstones are reclaimed in a process called compaction. While it takes a while to run, compaction is actually a good thing and ultimately helps to increase (mostly read) performance by reducing the number of files (and ultimately disk I/O) that need to be searched for a query. Different compaction strategies can be selected based on the use case. While it does impact performance, compaction throughput can be throttled (manually), so that it does not affect the node's ability to handle operations.

SizeTieredCompactionStrategy (default) may require up to 50% of the available disk space to complete its operations. Therefore, it is a good idea to plan for an extra 50% when sizing the hardware for the nodes.

In a distributed database environment (especially one that spans geographic regions), it is entirely possible that write operations may occasionally fail to distribute the required amount of replicas. Because of this, Cassandra comes with a tool known as repair. Cassandra anti-entropy repairs have two distinct operations:

Merkle trees are calculated for the current node (while communicating with other nodes) to determine replicas that need to be repaired (replicas that should exist, but do not)
Data is streamed from nodes that contain the desired replicas to fix the damaged replicas on the current node

To maintain data consistency, repair of the primary token ranges must be run on each node within the gc_grace_seconds period (default is 10 days) for a table. The recommended practice is for repairs to be run on a weekly basis.

Read operations in Cassandra are slightly more complex in nature. Similar to writes, they are served by structures that reside both on disk and in memory:

Cassandra reconciles read requests from structures both in memory and on disk.

A Read operation simultaneously checks structures in memory and on Disk. If the requested data is found in the Memtable structures of the current node, that data is merged with results obtained from the disk.

The read path from the disk also begins in memory. First, the Bloom Filter is checked. The Bloom Filter is a probability-based structure that speeds up reads from disk by determining which SSTables are likely to contain the requested data.

While not shown in the preceding figure, the row cache is checked for the requested data prior to the Bloom Filter. While disabled by default, the row cache can improve the performance of read-heavy workloads.

If the Bloom Filter was unable to determine which SSTables to check, the Partition Key Cache is queried next. The key cache is enabled by default, and uses a small, configurable amount of RAM.^[6] If a partition key is located, the request is immediately routed to the Compression Offset.

The Partition Key Cache can be tuned in the cassandra.yaml file, by adjusting the key_cache_size_in_mb and key_cache_save_period properties.

If a partition key is not located in the Partition Key Cache, the Partition Summary is checked next. The Partition Summary contains a sampling of the partition index data, which helps determine a range of partitions for the desired key. This is then verified against the Partition Index, which is an on-disk structure containing all of the partition keys.

Once a seek is performed against the Partition Index, its results are then passed to the Compression Offset. The Compression Offset is a map structure which^[6] stores the on-disk locations for all partitions. From here, the SSTable containing the requested data is queried, the data is then merged with the Memtable results, and the result set is built and returned.

One important takeaway, from analyzing the Cassandra read path, is that queries that return nothing do consume resources. Consider the possible points where data stored in Cassandra may be found and returned. Use of several of the structures in the read path only happens if the requested data is not found in the prior structure. Therefore, using Cassandra to check for the mere existence of data is not an efficient use case.

Table of Contents for
Seven NoSQL Databases in a Week

Overview of the internals

Table of Contents for Seven NoSQL Databases in a Week

Table of Contents for
Seven NoSQL Databases in a Week