Collations are used to sort text values. They can be
used with ORDER BY or GROUP BY clauses, or for defining indexes.
You can also assign a collation to a table column, so that any index or
ordering operation applied to that column will automatically use a specific
collation. Above everything else, SQLite will always sort by datatype. NULLs
will always come first, followed by a mix of integer and floating-point
numeric values in their natural sort order. After the numbers come text
values, followed by BLOBs.
Most types have a clearly defined sort order. NULL types have no values, so they cannot be sorted. Numeric types use their natural numeric ordering, and BLOBs are always sorted using binary comparisons. Where things get interesting is when it comes to text values.
The default collation is known as the
BINARY collation. The BINARY collation sorts individual bytes
using a simple numeric comparison of the underlying character encoding. The
BINARY collation is also used for
BLOBs.
In addition to the default BINARY collation, SQLite includes a built-in
NOCASE and RTRIM collation that can be used with text values. The
NOCASE collation ignores character
case for the purposes of sorting 7-bit ASCII, and would consider the
expression 'A' == 'a' to be true. It does
not, however, consider 'Ä' == 'ä' to be
true, nor does it consider 'Ä' == 'A' to
be true, as the representations of these characters are outside of the ASCII
standard. The RTRIM collation
(right-trim) is similar to the default BINARY collation, only it ignores trailing whitespace (that
is, whitespace on the right side of the value).
While these built-in collations offer some basic options, there are times when complex sort ordering is required. This is especially true when you get into Unicode representations of languages that cannot be represented with a simple 7-bit ASCII encoding. You may also need a specialized sorting function that sorts by whole words or groups of characters if you’re storing something other than natural language text. For example, if you were storing gene sequences as text data, you might require a custom sorting function for that data.
User-defined collation functions allow the developer to define a new collation by registering a comparison function. Once registered, this function is used to compare strings as part of any sorting process. By defining the basic comparison operator, you essentially define the behavior of the whole collation.
To define a custom collation, an application needs to register a comparison function under a collation name. Anytime the database engine needs to sort something under that collation, it uses the comparison function to define the required ordering. You will need to reregister the collation with each database connection that requires it.
There are three API calls that can be used to register a collation comparison function:
int sqlite3_create_collation( sqlite3 *db, const char *name,
int text_rep, void *udp, comp_func )
int sqlite3_create_collation16( sqlite3 *db, const void *name,
int text_rep, void *udp, comp_func )
Registers a collation comparison function with a database
connection. The first parameter is the database
connection. The second parameter is the name of the
custom collation encoded as a UTF-8 or UTF-16
string. The third parameter is the string encoding
the comparison function expects, and can be one of
SQLITE_UTF8,
SQLITE_UTF16,
SQLITE_UTF16BE,
SQLITE_UTF16LE,
or SQLITE_UTF16_ALIGNED (native UTF-16
that is 16-bit memory aligned). The fourth parameter
is a generic user-data pointer that is passed to
your comparison function. The last parameter is a
function pointer to your comparison function (the
prototype of this function is given below).
You can unregister a collation by passing a NULL function pointer in under the same name and text encoding as it was originally registered.
int sqlite3_create_collation_v2( sqlite3 *db, const char *name,
int text_rep, void *udp, comp_func,
dest_func )
This function is the same as sqlite3_create_collation(), with one
additional parameter. The additional sixth parameter
is an optional function pointer referencing a
clean-up function that is called when the collation
is destroyed (the prototype of this function is
given below). This allows the collation to release
any resources associated with the collation (such as
the user-data pointer). A NULL function pointer can
be passed in if no destroy function is
required.
A collation is destroyed when the database connection is closed, when a replacement collation is registered, or when the collation name is cleared by binding a NULL comparison function pointer.
The collation name is case-insensitive. SQLite allows multiple C sorting functions to be registered under the same name, so long as they take different text representations. If more than one comparison function is available under the same name, SQLite will pick the one that requires the least amount of conversion. If you do register more than one function under the same name, their logical sorting behavior should be the same.
The format of the user-defined function pointers is given below.
int user_defined_collation_compare( void* udp,
int lenA, const void *strA,
int lenB, const void *strB )
This is the function type of a
user-defined collation comparison function. The
first parameter is the user-data pointer passed into
sqlite3_create_collation_xxx() as the
fourth parameter. The parameters that follow pass in
the length and buffer pointers for two strings. The
strings will be in whatever encoding was defined by
the register function. You cannot assume the strings
are null-terminated.
The return value should be negative if string A is less than string B (that is, A sorts before B), 0 if the strings are considered equal, and positive if string A is greater than B (A sorts after B). In essence, the return value is the ordering of A minus B.
void user_defined_collation_destroy( void *udp )
This is the function type of the
user-defined collation destroy function. The only
parameter is the user-data pointer passed in as the
fourth parameter to sqlite3_create_collation_v2().
Although collation functions have access to a
user-data pointer, they don’t have an sqlite3_context pointer. That means there is no way
to communicate an error back to the SQLite engine. As such, if you
have a complex collation function, you should try to eliminate as many
error sources as you can. Specifically, that means it is a good idea
to pre-allocate any working buffers you might need, as there is no way
to abort a comparison if your memory allocations fail. Since the
collation function is really just a simple comparison, it is expected
to work and provide an answer every time.
Collations can also be dynamically registered on demand. See sqlite3_collation_needed() in Appendix G for more details.
Here is a simple example of a user-defined collation. In this example,
we’re defining a STRINGNUM
collation that can be used to sort string representations of numeric
values.
Unless they’re the same length, string
representations of numbers often sort in odd ways. For example, using
standard text sorting rules, the string '485' will sort before the string '73' because the character '4' sorts before the character
'7', just as the character
'D' sorts before the
character 'G'. To be clear, these
are text strings made up of characters that represent numeric digits,
not actual numbers.
The collation attempts to convert these strings
into a numeric representation and then use that numeric value for
sorting. Using this collation, the string '485' will sort after '73'. To keep things simple, we’re only going to deal
with integer values:
int col_str_num( void *udp,
int lenA, const void *strA,
int lenB, const void *strB )
{
int valA = col_str_num_atoi_n( (const char*)strA, lenA );
int valB = col_str_num_atoi_n( (const char*)strB, lenB );
return valA - valB;
}
static int col_str_num_atoi_n( const char *str, int len )
{
int total = 0, i;
for ( i = 0; i < len; i++ ) {
if ( ! isdigit( str[i] ) ) {
break;
}
total *= 10;
total += digittoint( str[i] );
}
return total;
}The collation attempts to convert each string into
an integer value using our custom
col_str_num_atoi_n() function, and then
compares the numeric results. The col_
str_
num_atoi_n() function is
very similar to the C standard atoi() function, with the prime difference that it
takes a maximum length parameter. That is required in this case, since
the strings passed into our collations may not be
null-terminated.
We would register this collation with SQLite like this:
sqlite3_create_collation( db, "STRINGNUM", SQLITE_UTF8, NULL, col_str_num );
Because the standard C function
isdigit() is not Unicode aware, our
collation sort function will only work with strings that are limited
to 7-bit ASCII.
We can then have SQL that looks like this:
sqlite>CREATE TABLE t ( s TEXT );sqlite>INSERT INTO t VALUES ( '485' );sqlite>INSERT INTO t VALUES ( '73' );sqlite>SELECT s FROM t ORDER BY s;485 73 sqlite>SELECT s FROM t ORDER BY s COLLATE STRINGNUM;73 485
It would also be possible to permanently associate our collation with a specific table column by including the collation in the table definition. See CREATE TABLE in Appendix C for more details.