Let's say that after taking a snapshot, one of our tables ends up with corrupted data for a particular user. The application team fixes the problem on their end, and simply asks for us to restore the logins_by_user table to the last snapshot.
First of all, let's take a look at the data in question:
cassdba@cqlsh> use packt ; cassdba@cqlsh:packt> SELECT * FROM logins_by_user WHERE user_id='avery' LIMIT 1; user_id | login_datetime | origin_ip ---------+---------------------------------+----------- avery | 1970-01-01 19:48:33.945000+0000 | 10.0.15.2 (1 rows)
Obviously, the new user did not recently log in on January 1, 1970, so our corrupted data has been presented to us. To ensure that we are starting from a clean slate, truncate the table:
cassdba@cqlsh:packt> truncate table packt.logins_by_user;
Assuming the data for our keyspace is in /var/lib/cassandra/data/packt, let's take a look at it:
cd /var/lib/cassandra/data/packt
ls -al
total 20
drwxrwxr-x 5 aploetz aploetz 4096 Jul 18 09:23 .
drwxr-xr-x 18 aploetz aploetz 4096 Jun 10 09:06 ..
drwxrwxr-x 3 aploetz aploetz 4096 Jul 18 14:05 astronauts-b27b5a406bc411e7b609c123c0f29bf4
drwxrwxr-x 3 aploetz aploetz 4096 Jul 18 14:05 astronauts_by_group-b2c163f06bc411e7b609c123c0f29bf4
drwxrwxr-x 4 aploetz aploetz 4096 Sep 9 14:51 logins_by_user-fdd9fa204de511e7a2e6f3d179351473
We see a directory for the logins_by_user table. Once a snapshot has been taken, each table directory should also have a "snapshots" directory, so let's cd into that and list it out:
cd logins_by_user-fdd9fa204de511e7a2e6f3d179351473/snapshots
ls -al
total 16
drwxrwxr-x 4 aploetz aploetz 4096 Sep 9 14:58 .
drwxrwxr-x 4 aploetz aploetz 4096 Sep 9 14:58 ..
drwxrwxr-x 2 aploetz aploetz 4096 Sep 9 14:55 1504986577085
drwxrwxr-x 2 aploetz aploetz 4096 Sep 9 14:58 truncated-1504987099599-logins_by_user
Recalling the output from our earlier nodetool snapshot command, the 1504986577085 directory was the name of the snapshot taken. Enter that directory, and list it out:
ls -al
total 52
drwxrwxr-x 2 aploetz aploetz 4096 Sep 9 14:49 .
drwxrwxr-x 3 aploetz aploetz 4096 Sep 9 14:49 ..
-rw-rw-r-- 1 aploetz aploetz 31 Sep 9 14:49 manifest.json
-rw-rw-r-- 2 aploetz aploetz 43 Jun 10 10:53 mc-1-big-CompressionInfo.db
-rw-rw-r-- 2 aploetz aploetz 264 Jun 10 10:53 mc-1-big-Data.db
-rw-rw-r-- 2 aploetz aploetz 9 Jun 10 10:53 mc-1-big-Digest.crc32
-rw-rw-r-- 2 aploetz aploetz 16 Jun 10 10:53 mc-1-big-Filter.db
-rw-rw-r-- 2 aploetz aploetz 11 Jun 10 10:53 mc-1-big-Index.db
-rw-rw-r-- 2 aploetz aploetz 4722 Jun 10 10:53 mc-1-big-Statistics.db
-rw-rw-r-- 2 aploetz aploetz 65 Jun 10 10:53 mc-1-big-Summary.db
-rw-rw-r-- 2 aploetz aploetz 92 Jun 10 10:53 mc-1-big-TOC.txt
-rw-rw-r-- 1 aploetz aploetz 947 Sep 9 14:49 schema.cql
All of these files need to be copied into the logins_by_user-fdd9fa204de511e7a2e6f3d179351473 directory. As we have navigated our way down to the directory containing the snapshot files, we can do this with a simple command:
cp * ../../
This copies all files from the current directory into the directory two levels up, which is /var/lib/cassandra/data/packt/logins_by_user-fdd9fa204de511e7a2e6f3d179351473. Now, we will bounce (stop/restart) our node. Go back into cqlsh, and rerun the prior query:
cassdba@cqlsh> use packt ; cassdba@cqlsh:packt> SELECT * FROM logins_by_user WHERE user_id='avery' LIMIT 1; user_id | login_datetime | origin_ip ---------+---------------------------------+----------- avery | 2017-09-09 19:48:33.945000+0000 | 10.0.15.2 (1 rows)
It is important to note that snapshots and incremental backups are essentially hard links created to sstable files on disk. These hard links prevent sstable files from being removed once compacted. Therefore, it is recommended to build a process to remove old snapshots and backups that are no longer needed.