Using SQLite

Advanced Example: weblog Module

Now that we’ve seen a very simple example of a virtual table module, you should have some idea of how they work. Although our dblist module was a good introduction to how virtual tables operate, it isn’t a very representative example.

To provide a more advanced and realistic example, we’re going to look at a second example module. This module is known as weblog, and is designed to parse Apache httpd server logs and present them to the database engine as a virtual table. It will parse the default Apache combine or common logfile formats, or any other logfile that matches this format. Apache logfiles are cross-platform and reasonably common. Many people have access to logfiles with a decent amount of interesting data, allowing this example to be a bit more hands-on.

Be warned that some of the code explanations will be a bit brief. Although the functions are larger, much of the code involves rather basic housekeeping-type tasks, such as string scanning. Rather than focus on these parts, most of the descriptions will focus on how the code interacts with the SQLite library. Many of the housekeeping details will be, as they say, left as an exercise for the reader.

The weblog module is designed as an external read-only module. The module gets all of its data directly from a web server logfile, making it dependent on external resources to provide data. The module does not let you modify those data sources, however.

A weblog virtual table would be created with an SQL command like this:

CREATE VIRTUAL TABLE current USING weblog( /var/log/httpd/access.log );

Notice that the filename has no single quotes and is not a string literal. Table parameters include everything between the commas (of which we have none, since there is only one argument), so if you need to reference a file with spaces, you can do something like this:

CREATE VIRTUAL TABLE log USING weblog( /var/log/httpd/access log file.txt );

The first example will create a table instance current, and bind it to the data found in the /var/log/httpd/access.log file. The second example will bind the SQL table log to the file access log file.txt in the same directory.

Briefly, the Apache common log format contains seven fields. The first field is the IP address of the client. In rare situations this might be a hostname, but most servers are configured to simply record the IP address in dot-decimal format. The second field is a legacy ident field. Most web servers do not support this, and record only a single dash. The third field records the username, if given. If not given, this field is also recorded as a single dash. The fourth field is a timestamp, surrounded by square brackets ( [ ] ). The fifth is the first line of the HTTP request, in double quotes. This contains the HTTP operation (such as GET or POST) as well as the URL. In the sixth column is the HTTP result code (e.g., 200 for OK, 404 for missing resource), with the number of payload bytes returned in the seventh field.

The combine file format adds two more fields. The eighth field is the referrer header, which contains a URL. The ninth field is the user-agent header, also in double quotes.

Count	Logfile field	Meaning
1	Client Address	IP or hostname of HTTP client
2	Ident	Legacy field, not used
3	Username	Client-provided username
4	Timestamp	Time of transaction
5	HTTP Request	HTTP operation and URL
6	Result Code	Result status of HTTP request
7	Bytes	Payload bytes
8	Referrer	URL of referrer page
9	User Agent	Client software identifier

The weblog module is designed to read the combine file format. However, if given a common logfile that lacks the last two fields, these extra fields will simply be NULL.

Although the logfile has seven or nine columns, the weblog virtual table will have more than nine columns. The virtual table adds a number of additional columns that present the same data in different ways.

For example, the IP address will be returned in one column as a text value that holds the traditional dotted notation. Another column will provide a raw integer representation. The text column is easier for humans to understand, but the integer column allows for faster searches, especially over ranges. The underlying data is the same: the two columns just return the data in different formats. Similarly, the timestamp column can return the string value from the logfile, or it can return separate integer values for the year, month, day, etc.

If this were a fully supported SQLite extension, it would likely include more than just the weblog module. Ideally, it would also include a number of utility functions, such as a function that converted text values containing dot-decimal IP addresses to and from integer values. (Then again, if this were a fully supported module, it would include decent error messages and other polish that this example lacks. I’m trying to keep the line counts as small as possible.) Some of these functions would reduce the need for extra columns, since you could just convert the data using SQL, but there are still times when having the extra columns is extremely useful.

Create and Connect

Since the weblog module is an external module, there isn’t any data to initialize. This means that, like the dblist, we can use the same function for both xCreate() and xConnect().

Before we get into the function, let’s have a quick look at our augmented vtab structure. Since this module does not use the table name for anything, the only data we need to keep around is the logfile filename:

typedef struct weblog_vtab_s {
    sqlite3_vtab   vtab;
    char           *filename;
} weblog_vtab;

The weblog create/connect function is a bit longer than the dblist version, but still fairly easy to follow. First, it verifies that we have exactly four arguments. Remember that the first three arguments are always the module name, the database name, and the table name. The fourth argument is the first user-provided argument, which in this case is the log filename. The function tries to open that file for read-only access, just to verify the file is there and can be opened it for reading. This test isn’t foolproof, but it is a nice check. The module then allocates the vtab structure, stashes a copy of the filename, and declares the table definition:

static int weblog_connect( sqlite3 *db, void *udp, int argc, 
        const char *const *argv, sqlite3_vtab **vtab, char **errmsg )
{
    weblog_vtab  *v = NULL;
    const char   *filename = argv[3];
    FILE         *ftest;

    if ( argc != 4 ) return SQLITE_ERROR;

    *vtab = NULL;
    *errmsg = NULL;

    /* test to see if filename is valid */
    ftest = fopen( filename, "r" );
    if ( ftest == NULL ) return SQLITE_ERROR;
    fclose( ftest );

    /* allocate structure and set data */
    v = sqlite3_malloc( sizeof( weblog_vtab ) );
    if ( v == NULL ) return SQLITE_NOMEM;
    ((sqlite3_vtab*)v)->zErrMsg = NULL; /* need to init this */

    v->filename = sqlite3_mprintf( "%s", filename );
    if ( v->filename == NULL ) {
        sqlite3_free( v );
        return SQLITE_NOMEM;
    }
    v->db = db;

    sqlite3_declare_vtab( db, weblog_sql );
    *vtab = (sqlite3_vtab*)v;
    return SQLITE_OK;
}

The table definition contains 20 columns total. The first 9 map directly to the fields within the logfile, while the extra 11 columns provide different representations of the same data. The last column represents the whole line of the logfile, without modifications:

const static char *weblog_sql = 
"    CREATE TABLE weblog (           "
"        ip_str       TEXT,          "  /*  0 */
"        login        TEXT HIDDEN,   "  /*  1 */
"        user         TEXT,          "  /*  2 */
"        time_str     TEXT,          "  /*  3 */
"        req          TEXT,          "  /*  4 */
"        result       INTEGER,       "  /*  5 */
"        bytes        INTEGER,       "  /*  6 */
"        ref          TEXT,          "  /*  7 */
"        agent        TEXT,          "  /*  8 */
#define TABLE_COLS_SCAN                9
"        ip_int       INTEGER,       "  /*  9 */
"        time_day     INTEGER,       "  /* 10 */
"        time_mon_s   TEXT,          "  /* 11 */
"        time_mon     INTEGER,       "  /* 12 */
"        time_year    INTEGER,       "  /* 13 */
"        time_hour    INTEGER,       "  /* 14 */
"        time_min     INTEGER,       "  /* 15 */
"        time_sec     INTEGER,       "  /* 16 */
"        req_op       TEXT,          "  /* 17 */
"        req_url      TEXT,          "  /* 18 */
"        line         TEXT HIDDEN    "  /* 19 */
"     );                             ";
#define TABLE_COLS                    20

You may have noticed a few of the columns have the keyword HIDDEN. This keyword is only valid for virtual table definitions. Any column marked HIDDEN will not be returned by SELECT * FROM... style queries. You can explicitly request the column, but it is not returned by default. This is very similar in behavior to the ROWID column found in standard tables. In our case, we’ve marked the login and line columns as HIDDEN. The login column almost never contains valid data, while the line column is redundant (and large). The columns are there if you need them, but in most cases people aren’t interested in seeing them. To keep the general output cleaner, I’ve chosen to hide them.

Disconnect and Destroy

As with xConnect() and xCreate(), the weblog xDisconnect() and xDestroy() functions share the same implementation:

static int weblog_disconnect( sqlite3_vtab *vtab )
{
    sqlite3_free( ((weblog_vtab*)vtab)->filename );
    sqlite3_free( vtab );
    return SQLITE_OK;
}

Free up the memory used for the filename, free up the memory used by the vtab structure, and return. Simple and easy.

Other Table Functions

The last set of table-level functions includes xBestIndex(), xFindFunction(), xRename(), and xUpdate(), as well as the four transactional functions, xBegin(), xSync(), xCommit(), and xRollback(). The xFindFunction() is optional, and the weblog module has no use for it, so there is no implementation of this function. Since this is a read-only module, same is true of xUpdate(). Similarly, the transactional functions are also optional and not required for read-only modules. For table-level functions, that leaves only xRename() and xBestIndex().

The xRename() function is required, but since the module makes no use of the virtual table instance name, it is basically a no-op:

static int weblog_rename( sqlite3_vtab *vtab, const char *newname )
{
    return SQLITE_OK;
}

In the case of the weblog module, once you set the name of the external logfile when creating a virtual table, there is no way to alter it, other than dropping and re-creating the table.

The last function, xBestIndex(), is required, but it isn’t actually returning any useful data:

static int weblog_bestindex( sqlite3_vtab *vtab, sqlite3_index_info *info )
{
    return SQLITE_OK;
}

Since the module has no indexing system, it can’t offer any optimized search patterns. The logfile is always scanned start to finish anyway, so every query is a full table scan.

Open and Close

We can now move on to the cursor functions. The first thing to look at is the weblog cursor structure. The weblog cursor is a bit more complex than the dblist example, as it needs to read and scan the data values from the logfile.

There are three basic sections to this structure. The first is the base sqlite3_vtab_cursor structure. As always, this must come first, and must be a full instance of the structure:

#define LINESIZE 4096

typedef struct weblog_cursor_s {
    sqlite3_vtab_cursor   cur;               /* this must be first */

    FILE           *fptr;                    /* used to scan file */
    sqlite_int64   row;                      /* current row count (ROWID) */
    int            eof;                      /* EOF flag */

    /* per-line info */
    char           line[LINESIZE];           /* line buffer */
    int            line_len;                 /* length of data in buffer */
    int            line_ptrs_valid;          /* flag for scan data */
    char           *(line_ptrs[TABLE_COLS]); /* array of pointers */
    int            line_size[TABLE_COLS];    /* length of data for each pointer */
} weblog_cursor;

The second block deals with the data we need to scan the logfile. The weblog module uses the standard C library f functions (such as fopen()) to open and scan the logfile. Each weblog cursor needs a unique FILE pointer, just as each dblist cursor required a unique statement structure. The module uses the FILE structure to keep track of its location within the file, so each cursor needs its own unique FILE structure. The cursor needs to keep track of the number of lines it has read from the file, as this value is used as the ROWID. Finally, the cursor needs an EOF flag to indicate when it has reached the end of the file.

Having a unique FILE pointer for each cursor means the module needs to reopen the file for each table scan. In the case of the weblog module, this is actually an advantage, as each table scan will reassociate itself with the correct file. This can be important in a web server environment, where logfiles may roll frequently.

The third section of the weblog_cursor structure holds everything the cursor needs to know about the current line. The cursor has a buffer to hold the text and length of the current line. There are also a series of pointers and length counters that are used to scan the line. Since scanning the line is fairly expensive, and must be done all at once, the module delays scanning the line until it’s sure the data is needed. Once scanned, the module will keep the scan data around until it reads a new line. To keep track of when a line has been scanned, the cursor contains a “valid” flag.

As we go through the rest of the module functions, you’ll see how these fields are used.

You might be thinking that a 4 KB line buffer seems a bit large, but frequently it is not enough. CGI scripts that use extensive query strings can generate very long logfile lines. Another issue is that many referrer URLs, especially those from search engines, can be extremely large. While most lines are only a hundred characters or so, it is best if the module can try to deal with the longer ones as well. Even with a 4 KB buffer, you’ll need to properly deal with potential buffer overflows.

Now that we’ve seen what the cursor looks like, let’s have a look at how it is opened and created. When the module needs to create a new cursor, it will first attempt to open the correct logfile. Assuming that succeeds, it will allocate the cursor structure and initialize the basic data:

static int weblog_open( sqlite3_vtab *vtab, sqlite3_vtab_cursor **cur )
{
    weblog_vtab     *v = (weblog_vtab*)vtab;
    weblog_cursor   *c;
    FILE            *fptr;

    *cur = NULL;

    fptr = fopen( v->filename, "r" );
    if ( fptr == NULL ) return SQLITE_ERROR;

    c = sqlite3_malloc( sizeof( weblog_cursor ) );
    if ( c == NULL ) {
        fclose( fptr );
        return SQLITE_NOMEM;
    }
    
    c->fptr = fptr;
    *cur = (sqlite3_vtab_cursor*)c;
    return SQLITE_OK;
}

The open function doesn’t need to initialize the line data, as this will all be reset when we read the first line from the data file.

The xClose() function is relatively simple:

static int weblog_close( sqlite3_vtab_cursor *cur )
{
    if ( ((weblog_cursor*)cur)->fptr != NULL ) {
        fclose( ((weblog_cursor*)cur)->fptr );
    }
    sqlite3_free( cur );
    return SQLITE_OK;
}

Close the file, release the memory.

Filter

Since the weblog module chooses to ignore the xBestIndex() function, it largely ignores xFilter() as well. The file is reset to the beginning, just to be sure, and the module reads the first line of data:

static int weblog_filter( sqlite3_vtab_cursor *cur,
        int idxnum, const char *idxstr,
        int argc, sqlite3_value **value )
{
    weblog_cursor   *c = (weblog_cursor*)cur;

    fseek( c->fptr, 0, SEEK_SET );
    c->row = 0;
    c->eof = 0;
    return weblog_get_line( (weblog_cursor*)cur );
}

The weblog_get_line() function reads in a single line from the logfile and copies it into our line buffer. It also verifies that it got a full line. If it didn’t get a full line, the function keeps reading (but discards the input) to make sure the file location is left at the beginning of the next valid line. We can reduce how often this happens by making the line buffer bigger, but no matter how big we make the buffer, it is always a good idea to make sure a whole line is consumed, even if the tail is discarded:

static int weblog_get_line( weblog_cursor *c )
{
    char   *cptr;
    int    rc = SQLITE_OK;

    c->row++;                          /* advance row (line) counter */
    c->line_ptrs_valid = 0;            /* reset scan flag */
    cptr = fgets( c->line, LINESIZE, c->fptr );
    if ( cptr == NULL ) {  /* found the end of the file/error */
        if ( feof( c->fptr ) ) {
            c->eof = 1;
        } else {
            rc = -1;
        }
        return rc;
    }
    /* find end of buffer and make sure it is the end a line... */
    cptr = c->line + strlen( c->line ) - 1;       /* find end of string */
    if ( ( *cptr != '\n' )&&( *cptr != '\r' ) ) { /* overflow? */
        char   buf[1024], *bufptr;
        /* ... if so, keep reading */
        while ( 1 ) {
            bufptr = fgets( buf, sizeof( buf ), c->fptr );
            if ( bufptr == NULL ) {  /* found the end of the file/error */
                if ( feof( c->fptr ) ) {
                    c->eof = 1;
                } else {
                    rc = -1;
                }
                break;
            }
            bufptr = &buf[ strlen( buf ) - 1 ];
            if ( ( *bufptr == '\n' )||( *bufptr == '\r' ) ) {
                break;               /* found the end of this line */
            }
        }
    }

    while ( ( *cptr == '\n' )||( *cptr == '\r' ) ) {
        *cptr-- = '\0';   /* trim new line characters off end of line */
    }
    c->line_len = ( cptr - c->line ) + 1;
    return rc;
}

Besides reading a full line, this function also resets the scan flag (to indicate the line buffer has not had the individual fields scanned) and adds one (1) to the line count. At the end, the function also trims off any trailing newline or carriage return characters.

Rows and Columns

We only have a few functions left. In specific, the module only needs to define the two row-handling functions, xNext() and xEof(). We also need the two column functions, xRowid() and xColumn().

Three of these four functions are quite simple. The xNext() function can call weblog_get_line(), just as the xFilter() function did. The xEof() and xRowid() functions return or pass back values that have already been calculated elsewhere:

static int weblog_next( sqlite3_vtab_cursor *cur )
{
    return weblog_get_line( (weblog_cursor*)cur );
}

static int weblog_eof( sqlite3_vtab_cursor *cur )
{
    return ((weblog_cursor*)cur)->eof;
}

static int weblog_rowid( sqlite3_vtab_cursor *cur, sqlite3_int64 *rowid )
{
    *rowid = ((weblog_cursor*)cur)->row;
    return SQLITE_OK;
}

The interesting function is the xColumn() function. If you’ll recall, in addition to the line buffer, the weblog_cursor structure also had an array of character pointers and length values. Each of these pointers and lengths corresponds to a column value in the defined table format. Before the module can extract those values, it needs to scan the input line and mark all the columns by setting the pointer and length values.

Using a length value means the module doesn’t need to insert termination characters into the original string buffer. That’s good, since several of the fields overlap. Using terminating characters would require making private copies of these data fields. In the end, a length value is quite useful anyway, as most of SQLite’s value-handling routines utilize length values.

The function that sets up all these pointers and length calculations is weblog_scanline(). We’ll work our way through this section by section. At the top are, of course, the variable definitions. The start and end pointers will be used to scan the line buffer, while the next value keeps track of the terminating character for the current field:

static int weblog_scanline( weblog_cursor *c )
{
    char   *start = c->line, *end = NULL, next = ' ';
    int    i;

    /* clear pointers */
    for ( i = 0; i < TABLE_COLS; i++ ) {
        c->line_ptrs[i] = NULL;
        c->line_size[i] = -1;
    }

With the variables declared, the first order of business is to reset all of the column pointers and sizes.

Next, the scan function loops over the native data fields in the line. This scans up to nine fields from the line buffer. These fields correspond to all the primary fields in a combine format logfile. If the logfile is a common format file (with only seven fields) or if the line buffer was clipped off, fewer fields are scanned. Any fields that are not properly scanned will eventually end up returning NULL SQL values:

    /* process actual fields */
    for ( i = 0; i < TABLE_COLS_SCAN; i++ ) {
        next = ' ';
        while ( *start == ' ' )  start++;     /* trim whitespace */
        if (*start == '\0' )  break;          /* found the end */
        if (*start == '"' ) {
            next = '"';  /* if we started with a quote, end with one */
            start++;
        }
        else if (*start == '[' ) {
            next = ']';  /* if we started with a bracket, end with one */
            start++;
        }
        end = strchr( start, next );    /* find end of this field */
        if ( end == NULL ) {            /* found the end of the line */
            int     len = strlen ( start );
            end = start + len;          /* end now points to '\0' */
        }
        c->line_ptrs[i] = start;        /* record start */
        c->line_size[i] = end - start;  /* record length */
        while ( ( *end != ' ' )&&( *end != '\0' ) )  end++;  /* find end */
        start = end;
    }

This loop attempts to scan one field at a time. The first half of the loop figures out the ending character of the field. In most cases it is a space, but it can also be a double-quote or square bracket. Once it knows what it’s looking for, the string is scanned for the next end marker. If the marker isn’t found, the rest of the string is used.

When this loop exits, the code has attempted to set up the first nine column pointers. These make up the native fields of the logfile. The next step is to set up pointers and lengths for the additional 11 columns that represent subfields and alternate representations. The first additional value is the IP address, returned as an integer. This function doesn’t do data conversions, so a direct copy of pointer and length from the first column can be made:

    /* process special fields */
    /* ip_int - just copy */
    c->line_ptrs[9] = c->line_ptrs[0];
    c->line_size[9] = c->line_size[0];

Next, all of the date field pointers and lengths are set up. This section of code makes some blatant assumptions about the format of the timestamp, but there isn’t much choice. The code could scan the individual fields, but it would still be forced to make assumptions about the ordering of the fields. In the end, it is easiest to just assume the format is consistent and hardcode the field lengths. This example ignores the time zone information:

    /* assumes: "DD/MMM/YYYY:HH:MM:SS zone" */
    /*     idx:  012345678901234567890...   */
    if (( c->line_ptrs[3] != NULL )&&( c->line_size[3] >= 20 )) {
        start = c->line_ptrs[3];
        c->line_ptrs[10] = &start[0];    c->line_size[10] = 2;
        c->line_ptrs[11] = &start[3];    c->line_size[11] = 3;
        c->line_ptrs[12] = &start[3];    c->line_size[12] = 3;
        c->line_ptrs[13] = &start[7];    c->line_size[13] = 4;
        c->line_ptrs[14] = &start[12];   c->line_size[14] = 2;
        c->line_ptrs[15] = &start[15];   c->line_size[15] = 2;
        c->line_ptrs[16] = &start[18];   c->line_size[16] = 2;
    }

After the date fields, the next step is to extract the HTTP operation and URL. These are extracted as the first two subfields of the HTTP Request log field. The code plays some games to be sure it doesn’t accidentally pass a NULL pointer into strchr(), but otherwise it just finds the first two spaces and considers those to be the ending of the two fields it is trying to extract:

    /* req_op, req_url */
    start = c->line_ptrs[4];
    end = ( start == NULL ? NULL : strchr( start, ' ' ) );
    if ( end != NULL ) {
        c->line_ptrs[17] = start;
        c->line_size[17] = end - start;
        start = end + 1;
    }
    end = ( start == NULL ? NULL : strchr( start, ' ' ) );
    if ( end != NULL ) {
        c->line_ptrs[18] = start;
        c->line_size[18] = end - start;
    }

The final column represents the full contents of the line buffer. We also need to set the valid flag to indicate the field pointers are valid and ready for use:

    /* line */
    c->line_ptrs[19] = c->line;
    c->line_size[19] = c->line_len;

    c->line_ptrs_valid = 1;
    return SQLITE_OK;
}

Once this function has been called, all the fields that could be scanned will have a valid pointer and length value. With the data scanned, this and subsequent calls to xColumn() can use the relevant values to pass back their database values. Let’s return to looking at xColumn().

The first thing the xColumn() code does is making sure the line has already been scanned. If not, the code calls weblog_scanline() to set up all the field pointers:

static int weblog_column( sqlite3_vtab_cursor *cur, sqlite3_context *ctx, int cidx )
{
    weblog_cursor    *c = (weblog_cursor*)cur;

    if ( c->line_ptrs_valid == 0 ) {
        weblog_scanline( c );         /* scan line, if required */
    }
    if ( c->line_size[cidx] < 0 ) {   /* field not scanned and set */
        sqlite3_result_null( ctx );
        return SQLITE_OK;
    }

Next, if the requested column doesn’t have a valid set of values, the module passes back an SQL NULL for the column.

The code then processes columns with specific conversion needs. Any column that needs special processing or conversion will be caught by this switch statement. The first specialized column is the integer version of the IP address. This block of code converts each octet of the IP address into an integer value. The only issue is that all integer values within SQLite are signed, so the code needs to be careful about constructing the value into a 64-bit integer. For maximum compatibility, it avoids using shift operations:

    switch( cidx ) {
    case 9: { /* convert IP address string to signed 64 bit integer */
        int            i;
        sqlite_int64   v = 0;
        char          *start = c->line_ptrs[cidx], *end, *oct[4];

        for ( i = 0; i < 4; i++ ) {
            oct[i] = start;
            end = ( start == NULL ? NULL : strchr( start, '.' ) );
            if ( end != NULL ) {
                start = end + 1;
            }
        }
        v += ( oct[3] == NULL ? 0 : atoi( oct[3] ) ); v *= 256;
        v += ( oct[2] == NULL ? 0 : atoi( oct[2] ) ); v *= 256;
        v += ( oct[1] == NULL ? 0 : atoi( oct[1] ) ); v *= 256;
        v += ( oct[0] == NULL ? 0 : atoi( oct[0] ) );
        sqlite3_result_int64( ctx, v );
        return SQLITE_OK;
    }

The next specialized column is one of the two month fields. In the logfile, the month value is given as a three-character abbreviation. One column returns this original text value, while another returns a numeric value. To convert from the abbreviation to the numeric value, the code simply looks for constants in the month string. If it can’t find a match, the code breaks out. As we’ll see, if the code breaks out it will eventually end up returning the text value:

    case 12: { 
        int m = 0;
             if ( strncmp( c->line_ptrs[cidx], "Jan", 3 ) == 0 ) m =  1;
        else if ( strncmp( c->line_ptrs[cidx], "Feb", 3 ) == 0 ) m =  2;
        else if ( strncmp( c->line_ptrs[cidx], "Mar", 3 ) == 0 ) m =  3;
        else if ( strncmp( c->line_ptrs[cidx], "Apr", 3 ) == 0 ) m =  4;
        else if ( strncmp( c->line_ptrs[cidx], "May", 3 ) == 0 ) m =  5;
        else if ( strncmp( c->line_ptrs[cidx], "Jun", 3 ) == 0 ) m =  6;
        else if ( strncmp( c->line_ptrs[cidx], "Jul", 3 ) == 0 ) m =  7;
        else if ( strncmp( c->line_ptrs[cidx], "Aug", 3 ) == 0 ) m =  8;
        else if ( strncmp( c->line_ptrs[cidx], "Sep", 3 ) == 0 ) m =  9;
        else if ( strncmp( c->line_ptrs[cidx], "Oct", 3 ) == 0 ) m = 10;
        else if ( strncmp( c->line_ptrs[cidx], "Nov", 3 ) == 0 ) m = 11;
        else if ( strncmp( c->line_ptrs[cidx], "Dec", 3 ) == 0 ) m = 12;
        else break;    /* give up, return text */
        sqlite3_result_int( ctx, m );
        return SQLITE_OK;
    }

There are a number of additional columns (including some of the “native” ones) that are returned as integers. None of these columns require special processing, other than the string-to-integer conversion. The standard atoi() function is used for this conversion. Although the string pointers are not null-terminated, the atoi() function will automatically return once it encounters a non-numeric character. Since all of these fields are bound by spaces or other characters, this works out exactly the way we want:

    case 5:    /* result code */
    case 6:    /* bytes transfered */
    case 10:   /* day-of-month */
    case 13:   /* year */
    case 14:   /* hour */
    case 15:   /* minute */
    case 16:   /* second */
        sqlite3_result_int( ctx, atoi( c->line_ptrs[cidx] ) );
        return SQLITE_OK;
    default:
        break;
    }
    sqlite3_result_text( ctx, c->line_ptrs[cidx],
                              c->line_size[cidx], SQLITE_STATIC );
    return SQLITE_OK;
}

Finally, any field that did not require special processing is returned as a text value. Although the line buffer will be overwritten when the next line is read, the data pointer passed into sqlite3_result_text() only needs to stay valid until the next call to xNext(). This allows the module to use the SQLITE_STATIC flag.

With that, we’ve defined all the required functions for our weblog module.

Register the Module

Now that we’ve seen how all the module functions are implemented, the last thing to do is register the weblog module as part of the extension initialization function:

static sqlite3_module weblog_mod = {
    1,                   /* iVersion        */
    weblog_connect,      /* xCreate()       */
    weblog_connect,      /* xConnect()      */
    weblog_bestindex,    /* xBestIndex()    */
    weblog_disconnect,   /* xDisconnect()   */
    weblog_disconnect,   /* xDestroy()      */
    weblog_open,         /* xOpen()         */
    weblog_close,        /* xClose()        */
    weblog_filter,       /* xFilter()       */
    weblog_next,         /* xNext()         */
    weblog_eof,          /* xEof()          */
    weblog_column,       /* xColumn()       */
    weblog_rowid,        /* xRowid()        */
    NULL,                /* xUpdate()       */
    NULL,                /* xBegin()        */
    NULL,                /* xSync()         */
    NULL,                /* xCommit()       */
    NULL,                /* xRollback()     */
    NULL,                /* xFindFunction() */
    weblog_rename        /* xRename()       */
};

int weblog_init( sqlite3 *db, char **error, const sqlite3_api_routines *api )
{
    SQLITE_EXTENSION_INIT2(api);
    return sqlite3_create_module( db, "weblog", &weblog_mod, NULL );
}

Since there is no attempt to create an instance of a weblog table, this initialization function is a bit simpler than the previous dblist example.

Example Usage

Now that we’ve worked through the whole example, let’s see what the code can do. Here are a few different examples that show off the power of the weblog module.

While doing these types of queries is not a big deal for people that are comfortable with SQL, realize that we can run all of these queries without having to first import the logfile data. Not only does that make the whole end-to-end process much faster, it means we can run these types of queries against active, “up to the second” logfiles.

To show off how this module works, the server administrators of http://oreilly.com/ were nice enough to provide me with some of their logfiles. The file referred to as oreilly.com_access.log is an Apache combine logfile with 100,000 lines of data. Once compiled and built into a loadable module, we can import the weblog module and create a virtual table that is bound to this file using these commands:

sqlite> .load weblog.sqlite3ext weblog_init
sqlite> CREATE VIRTUAL TABLE log USING weblog( oreilly.com_access.log );

We then issue queries to look at different aspects of the file. For example, if we want to know what the most common URL is, we run a query like this:

sqlite> SELECT count(*) AS Count, req_url AS URL FROM log
   ...>   GROUP BY 2 ORDER BY 1 DESC LIMIT 8;

Count  URL                                     
-----  ----------------------------------------
2490   /images/oreilly/button_cart.gif         
2480   /images/oreilly/button_acct.gif         
2442   /styles/all.css                         
2348   /images/oreilly/888-line.gif            
2233   /styles/chrome.css                      
2206   /favicon.ico                            
1975   /styles/home2.css                       
1941   /images/oreilly/satisfaction-icons.gif

It is fairly common to see favicon.ico very near the top, along with any site-wide CSS and image files. In the case of smaller sites that have a lot less traffic, it isn’t uncommon for the most requested URL to be /robots.txt, which is used by search engines.

We can also see what the most expensive items on the website are, in terms of bytes moved:

sqlite> SELECT sum(bytes) AS Bytes, count(*) AS Count, req_url AS URL
   ...>   FROM log WHERE result = 200 GROUP BY 3 ORDER BY 1 DESC LIMIT 8;

Bytes     Count  URL                                          
--------  -----  ---------------------------------------------
46502163  1137   /images/oreilly/mac_os_x_snow_leopard-148.jpg
40780252  695    /                                            
37171328  2384   /styles/all.css                              
35403200  2180   /styles/chrome.css                           
31728906  781    /catalog/assets/pwr/engine/js/full.js        
31180460  494    /catalog/9780596510046/index.html            
21573756  88     /windows/archive/PearPC.html                 
21560154  3      /catalog/dphotohdbk/chapter/ch03.pdf

We see that some of these items are not that large, but are requested frequently. Other items have only a small number of requests, but are big enough to make a noticeable contribution to the total number of served bytes.

Here is one final example. This shows what IP addresses are downloading the most number of unique items. Since this is from live data, I’ve altered the IP addresses:

sqlite> SELECT count(*) AS Uniq, sum(sub_count) AS Ttl,
   ...>        sum(sub_bytes) AS TtlBytes, sub_ip AS IP
   ...>   FROM (SELECT count(*) AS sub_count, sum(bytes) AS sub_bytes,
   ...>                ip_str AS sub_ip FROM log GROUP BY 3, req_url)
   ...>   GROUP BY 4 ORDER BY 1 DESC LIMIT 8;

Uniq  Ttl   TtlBytes    IP                
----  ----  ----------  ------------------
1295  1295  31790418    10.5.69.83
282   334   13571771    10.170.13.97
234   302   4234382     10.155.7.28
213   215   3089112     10.155.7.77
163   176   2550477     10.155.7.29
159   161   4279779     10.195.137.175
153   154   2292407     10.23.146.198
135   171   2272949     10.155.7.71

For each IP address, the first column is the number of unique URLs requested, while the second column is the total number of requests. The second column should always be greater than or equal to the first column. The third column is the total number of bytes, followed by the (altered) IP address in question. Exactly how this query works is left as an exercise for the reader.

There are countless other queries we could run. For anyone that has ever imported log data into an SQL database and played around with it, none of this is particularly inspiring. But consider this for a moment: the query time for the first two of these examples is a bit less than five seconds on an economy desktop system that is several years old. The third query was a bit closer to eight seconds.

Five seconds to scan a 100,000-row table might not be blazingly fast, but remember that those five seconds are the grand total for everything, including data “import.” Using the virtual table module allows us to go from a raw logfile with 100,000 lines to a query answer in just that amount of time—no data staging, no format conversions, no data imports. That’s important, since importing involves a lot of I/O and can be a slow process. For example, importing the same file into a standard SQLite table by more traditional means takes nearly a minute and that doesn’t even include any queries!

Now consider that we enable all this functionality with less than 400 lines of C code. Accessing the original data, rather than importing it into standard tables, allows the end-to-end data analysis process to be much faster, and allows you to query the data, as it is recorded by the web server, in real time. As an added bonus, the virtual table can also be used as an importer, by using the CREATE TABLE... AS or INSERT... SELECT SQL commands.

If you find yourself faced with the task of writing a script to analyze, search, or summarize some structured source of data, you might consider writing an SQLite module instead. A basic, read-only module is a fairly minor project, and once you’ve got that in place you have the complete power of the SQLite database engine at your disposal (plus an added data importer!). That makes it easy to write, test, and tune whatever queries you need in just a few lines of SQL.

Previous Chapter

Simple Example: dblist Module

Next Chapter

Best Index and Filter

Table of Contents for Using SQLite