Now that we know what partition and clustering keys are, we can put them to use. Assume that I have to design a table to track user logins to an application. I could solve that with a table like this:
CREATE TABLE packt.logins_by_user ( user_id text, login_datetime timestamp, origin_ip text, PRIMARY KEY ((user_id), login_datetime) ) WITH CLUSTERING ORDER BY (login_datetime DESC);
Next, I'll write some data to test with:
INSERT INTO logins_by_user (user_id, login_datetime, origin_ip) VALUES
('aploetz','2017-06-01 12:36:01','192.168.0.101'); INSERT INTO logins_by_user (user_id, login_datetime, origin_ip) VALUES
('aploetz','2017-06-01 12:53:28','192.168.0.101'); INSERT INTO logins_by_user (user_id, login_datetime, origin_ip) VALUES
('aploetz','2017-06-02 13:23:11','192.168.0.105'); INSERT INTO logins_by_user (user_id, login_datetime, origin_ip) VALUES
('aploetz','2017-06-03 09:04:55','192.168.0.101');
Take notice of the primary key definition in the preceding code. I am specifying a partition key of user_id. This means that data for each user_id will be stored together. By specifying a clustering key of login_datetime, data returned from this table will be sorted by login_datetime, in descending order. You can see this when I query for a particular user_id:
aploetz@cqlsh:packt> SELECT * FROM logins_by_user WHERE user_id='aploetz';
user_id | login_datetime | origin_ip
---------+---------------------------------+---------------
aploetz | 2017-06-03 14:04:55.000000+0000 | 192.168.0.101
aploetz | 2017-06-02 18:23:11.000000+0000 | 192.168.0.105
aploetz | 2017-06-01 17:53:28.000000+0000 | 192.168.0.101
aploetz | 2017-06-01 17:36:01.000000+0000 | 192.168.0.101
(4 rows)
As you can see, there are four entries for the aploetz user_id. When I query this table, I can also use the token() function to see the hashed token for each partition key:
aploetz@cqlsh:packt> SELECT token(user_id), user_id, login_datetime
FROM logins_by_user
WHERE user_id='aploetz';
system.token(user_id) | user_id | login_datetime
-----------------------+---------+---------------------------------
-1472930629430174260 | aploetz | 2017-06-03 14:04:55.000000+0000
-1472930629430174260 | aploetz | 2017-06-02 18:23:11.000000+0000
-1472930629430174260 | aploetz | 2017-06-01 17:53:28.000000+0000
-1472930629430174260 | aploetz | 2017-06-01 17:36:01.000000+0000
(4 rows)
Looking at the first column of the result set, you can see that the user_id all match to the same token. This means that they will be stored in the same partition, and thus, together on any node responsible for the token range that encompasses -1472930629430174260. Within this partition, the results are ordered by login_datetime, descending.
WHERE clauses in a query can only contain components of the primary key. Furthermore, they must respect the order of the keys. You can omit clustering keys, but you cannot skip them. For instance, I can omit login_datetime because I am specifying the keys that precede it. I cannot omit user_id and only query by login_datetime, because Cassandra needs to know which partition to look at, and cannot figure that out from a clustering key. If there was a clustering key defined between user_id and login_datetime, I would have to specify that key too if I wanted to specify login_datetime in my WHERE clause.