Using SQLite

Collation Functions

Collations are used to sort text values. They can be used with ORDER BY or GROUP BY clauses, or for defining indexes. You can also assign a collation to a table column, so that any index or ordering operation applied to that column will automatically use a specific collation. Above everything else, SQLite will always sort by datatype. NULLs will always come first, followed by a mix of integer and floating-point numeric values in their natural sort order. After the numbers come text values, followed by BLOBs.

Most types have a clearly defined sort order. NULL types have no values, so they cannot be sorted. Numeric types use their natural numeric ordering, and BLOBs are always sorted using binary comparisons. Where things get interesting is when it comes to text values.

The default collation is known as the BINARY collation. The BINARY collation sorts individual bytes using a simple numeric comparison of the underlying character encoding. The BINARY collation is also used for BLOBs.

In addition to the default BINARY collation, SQLite includes a built-in NOCASE and RTRIM collation that can be used with text values. The NOCASE collation ignores character case for the purposes of sorting 7-bit ASCII, and would consider the expression 'A' == 'a' to be true. It does not, however, consider 'Ä' == 'ä' to be true, nor does it consider 'Ä' == 'A' to be true, as the representations of these characters are outside of the ASCII standard. The RTRIM collation (right-trim) is similar to the default BINARY collation, only it ignores trailing whitespace (that is, whitespace on the right side of the value).

While these built-in collations offer some basic options, there are times when complex sort ordering is required. This is especially true when you get into Unicode representations of languages that cannot be represented with a simple 7-bit ASCII encoding. You may also need a specialized sorting function that sorts by whole words or groups of characters if you’re storing something other than natural language text. For example, if you were storing gene sequences as text data, you might require a custom sorting function for that data.

User-defined collation functions allow the developer to define a new collation by registering a comparison function. Once registered, this function is used to compare strings as part of any sorting process. By defining the basic comparison operator, you essentially define the behavior of the whole collation.

Registering a Collation

To define a custom collation, an application needs to register a comparison function under a collation name. Anytime the database engine needs to sort something under that collation, it uses the comparison function to define the required ordering. You will need to reregister the collation with each database connection that requires it.

There are three API calls that can be used to register a collation comparison function:

int sqlite3_create_collation( sqlite3 *db, const char *name, int text_rep, void *udp, comp_func ) int sqlite3_create_collation16( sqlite3 *db, const void *name, int text_rep, void *udp, comp_func )

Registers a collation comparison function with a database connection. The first parameter is the database connection. The second parameter is the name of the custom collation encoded as a UTF-8 or UTF-16 string. The third parameter is the string encoding the comparison function expects, and can be one of SQLITE_UTF8, SQLITE_UTF16, SQLITE_UTF16BE, SQLITE_UTF16LE, or SQLITE_UTF16_ALIGNED (native UTF-16 that is 16-bit memory aligned). The fourth parameter is a generic user-data pointer that is passed to your comparison function. The last parameter is a function pointer to your comparison function (the prototype of this function is given below).

You can unregister a collation by passing a NULL function pointer in under the same name and text encoding as it was originally registered.

int sqlite3_create_collation_v2( sqlite3 *db, const char *name, int text_rep, void *udp, comp_func, dest_func )

This function is the same as sqlite3_create_collation(), with one additional parameter. The additional sixth parameter is an optional function pointer referencing a clean-up function that is called when the collation is destroyed (the prototype of this function is given below). This allows the collation to release any resources associated with the collation (such as the user-data pointer). A NULL function pointer can be passed in if no destroy function is required.

A collation is destroyed when the database connection is closed, when a replacement collation is registered, or when the collation name is cleared by binding a NULL comparison function pointer.

The collation name is case-insensitive. SQLite allows multiple C sorting functions to be registered under the same name, so long as they take different text representations. If more than one comparison function is available under the same name, SQLite will pick the one that requires the least amount of conversion. If you do register more than one function under the same name, their logical sorting behavior should be the same.

The format of the user-defined function pointers is given below.

int user_defined_collation_compare( void* udp, int lenA, const void *strA, int lenB, const void *strB )

This is the function type of a user-defined collation comparison function. The first parameter is the user-data pointer passed into sqlite3_create_collation_xxx() as the fourth parameter. The parameters that follow pass in the length and buffer pointers for two strings. The strings will be in whatever encoding was defined by the register function. You cannot assume the strings are null-terminated.

The return value should be negative if string A is less than string B (that is, A sorts before B), 0 if the strings are considered equal, and positive if string A is greater than B (A sorts after B). In essence, the return value is the ordering of A minus B.

void user_defined_collation_destroy( void *udp )

This is the function type of the user-defined collation destroy function. The only parameter is the user-data pointer passed in as the fourth parameter to sqlite3_create_collation_v2().

Although collation functions have access to a user-data pointer, they don’t have an sqlite3_context pointer. That means there is no way to communicate an error back to the SQLite engine. As such, if you have a complex collation function, you should try to eliminate as many error sources as you can. Specifically, that means it is a good idea to pre-allocate any working buffers you might need, as there is no way to abort a comparison if your memory allocations fail. Since the collation function is really just a simple comparison, it is expected to work and provide an answer every time.

Collations can also be dynamically registered on demand. See sqlite3_collation_needed() in Appendix G for more details.

Collation Example

Here is a simple example of a user-defined collation. In this example, we’re defining a STRINGNUM collation that can be used to sort string representations of numeric values.

Unless they’re the same length, string representations of numbers often sort in odd ways. For example, using standard text sorting rules, the string '485' will sort before the string '73' because the character '4' sorts before the character '7', just as the character 'D' sorts before the character 'G'. To be clear, these are text strings made up of characters that represent numeric digits, not actual numbers.

The collation attempts to convert these strings into a numeric representation and then use that numeric value for sorting. Using this collation, the string '485' will sort after '73'. To keep things simple, we’re only going to deal with integer values:

int col_str_num( void *udp,
        int lenA, const void *strA,
        int lenB, const void *strB )
{
    int valA = col_str_num_atoi_n( (const char*)strA, lenA );
    int valB = col_str_num_atoi_n( (const char*)strB, lenB );

    return valA - valB;
}

static int col_str_num_atoi_n( const char *str, int len )
{
    int total = 0, i;
    for ( i = 0; i < len; i++ ) {
        if ( ! isdigit( str[i] ) ) {
            break;
        }
        total *= 10;
        total += digittoint( str[i] );
    }
    return total;
}

The collation attempts to convert each string into an integer value using our custom col_str_num_atoi_n() function, and then compares the numeric results. The col_ str_ num_atoi_n() function is very similar to the C standard atoi() function, with the prime difference that it takes a maximum length parameter. That is required in this case, since the strings passed into our collations may not be null-terminated.

We would register this collation with SQLite like this:

sqlite3_create_collation( db, "STRINGNUM", SQLITE_UTF8, NULL, col_str_num );

Because the standard C function isdigit() is not Unicode aware, our collation sort function will only work with strings that are limited to 7-bit ASCII.

We can then have SQL that looks like this:

sqlite> CREATE TABLE t ( s TEXT );
sqlite> INSERT INTO t VALUES ( '485' );
sqlite> INSERT INTO t VALUES ( '73' );
sqlite> SELECT s FROM t ORDER BY s;
485
73
sqlite> SELECT s FROM t ORDER BY s COLLATE STRINGNUM;
73
485

It would also be possible to permanently associate our collation with a specific table column by including the collation in the table definition. See CREATE TABLE in Appendix C for more details.

Table of Contents for Using SQLite

Collation Functions

Registering a Collation

Collation Example

Table of Contents for
Using SQLite