21st Century C

Chapter 11. Object-Oriented Programming in C

We favor the simple expression of the complex thought.
...
We are for flat forms
Because they destroy illusion and reveal truth.
—Le Tigre, “Slideshow at Free University”

Here is the common format for the typical library, in C or in any other language:

A small set of data structures that represent key aspects of the field the library addresses.
A set of functions (often referred to as interface functions) that manipulate those data structures.

An XML library, for example, would have a structure representing an XML document and perhaps views of the document, plus lots of functions for going between the data structure and the XML file on disk, querying the structure for elements, et cetera. A database library would have a structure representing the state of communications with the database, and perhaps structures representing tables, plus lots of functions for talking to the database and dissecting the data it sends.

This is an eminently sensible way to organize a program or a library. It is the means by which an author can represent concepts with nouns and verbs that are appropriate to the problem at hand.

The first fun exercise in object-oriented programming (OOP) is defining the term, and although I won’t waste time (and invite flame wars) by giving a precise definition, the preceding description of an object-oriented library should give you a feel for what we are going after: a few central data structures, each with a set of functions that act on those central data structures. Building a structure like this is language-independent, unless you have a serious and firm belief in the Sapir-Whorf hypothesis.

Because OOP has so much baggage associated, the first part of this chapter is mostly concerned with what portion of the OOP world is really necessary (or even desirable) for structuring our writing. The discussion will then turn to what an object looks like in C, such as how we can have one structure inherit from another, or when to put struct-related methods inside of the struct itself. The chapter will conclude with a few full-scale examples of some objects with nontrivial problems, like the need for reference counting. Even with the limited syntactic tools we have, the framework works gracefully and is maintainable.

What You Don’t Get (and Why You Won’t Miss It)

C is great for its simplicity, but with such a simple grammar, how are we to resolve the multiple virtual constructors one can be saddled with when instantiating an instance of a subclass derived via polymorphic inheritance?

And the simple answer, of course, is that we don’t.

You will find OOP proponents who insist that it’s not OOP without a message-passing infrastructure, object-level scoping constructs, operator overloading, class inheritance rules, et cetera; and for each of these features, you will find an OOP proponent who insists that the feature is an irrelevance to true OOP.^[18]

It is so much easier to just stop worrying about it. To develop central structures and build functions that make use of them, C syntax easily provides the 10% of the edifice that creates 90% of the structure.

Scope

The scope of a variable is the range of the code over which it exists and can be used. The rule of thumb for sane programming is to keep the scope of a variable as small as practicable, because doing so limits the number of variables you have to keep in mind at any given point, and means lower risk that a variable will be changed by code you didn’t bear in mind.

OK, here goes: all of the rules for variable scope in C.

A variable never has scope in the code before it is declared. That would be silly.
If a variable is declared somewhere inside a pair of curly braces, then at the closing curly brace, the variable goes out of scope. Semi-exception: for loops and functions may have variables declared in a set of parens just before their opening curly brace; variables declared within the parens have scope as if they were declared inside the curly braces.
If a variable isn’t inside any curly braces, then it has scope from its declaration to the end of the file.

You’re done.

There is no class scope, prototype scope, friend scope, namespace scope, runtime environment rebinding, or special scoping keywords or operators (beyond those curly braces, and arguably the linkage specifiers static and extern). Does dynamic scoping confuse you? Don’t worry about it. If you know where the curly braces are, you can determine which variables can be used where.

Everything else is a simple corollary. For example, if code.c has a line that will #include <header.h>, then the full text of header.h is pasted into code.c, and variables therein have scope accordingly.

Functions are just another example of curly-brace scope. Here is a sample function to sum all the integers up to the input number:

int sum (int max){
    int total=0;
    for (int i=0; i<= max; i++){
        total += i;
    }
    return total;
}

Then max and total have scope inside the function, by the curly-brace rule and the semi-exception about how variables in parens just before the curly brace act as if they are inside the braces. The same holds with the for loop, and how i is born and dies with the curly braces of the for loop. If you have a one-line for loop, you don’t have to write the curly braces, like for (int i=0; i <= max; i++) total += i;, but the scope of i is still limited to the loop.

Summary paragraph: C is awesome for having such simple scoping rules, which effectively consist of finding the end of the enclosing curly braces or the end of the file. You can teach the whole scoping system to a novice student in maybe 10 minutes. For the experienced author, the rule is more general than just the curly braces for functions and for loops, so you can use them for occasional additional scoping restrictions in exceptional situations, as in the macro tricks in Cultivate Robust and Flourishing Macros.

Private struct elements

So we’re cathartically throwing out all the additional rules and keywords that support very fine-grained scope control.

Could we implement private struct elements without the extra keywords? In typical OOP usage, “private” data is not encrypted by the compiler or otherwise seriously hidden: if you have the address of the variable (e.g., if you have its offset in the struct), you can point to it, look at it in the debugger, and modify it. To give the data that level of opacity, we have the technology.

An object will typically be defined via two files: the .c file with the details and the .h file to be included in other writing that makes use of the object. It is not unreasonable to think of the .c file as the private segment and the .h file as the public. For example, say we are set on keeping some elements of an object private. The public header might be:

typedef struct a_box_s {
    int public_size;
    void *private;
} a_box_s;

The void pointer is basically useless to other authors, because they don’t know what type to cast it to. The private segment, a_box.c, would hold the requisite typedef:

typedef struct private_box_s {
    long double how_much_i_hate_my_boss;
    char **coworkers_i_have_a_crush_on;
    double fudge_factor;
} private_box_s;

//Given the typedef, we have no problem casting the private pointer to
//its desired type and making use here in a_box.c.

a_box_s *box_new(){
    a_box_s *out = malloc(sizeof(a_box_s));
    private_box_s *outp = malloc(sizeof(private_box_s));
    *out = (a_box_s){.public_size=0, .private=outp};
    return out;
}

void box_edit(a_box_s *in){
    private_box_s *pb = in->private;
    //now work with private variables, e.g.:
    pb->fudge_factor *= 2;
}

So it’s not all that hard to implement a private segment of a C struct, but I rarely see it used in real-world libraries. Few C authors seem to think that there’s serious benefit to doing so.

Here’s a sample of the much more common means of putting a private element within a public struct:

typedef struct {
    int pub_a, pub_b;
    int private_a, private_b; //Private: please do not use these.
} public_s;

That is, document when something should not be used, and trust your users to not cheat. If your colleagues won’t follow an instruction as simple as this, then chain the coffee maker to the wall, because you’ve got problems bigger than a compiler can solve.

Functions are especially easy to make private: don’t put their declaration in a header. Optionally, put the static keyword in front of the definition so that readers know that the function is private.

Overloaded with Operator Overloading

My impression is that most folks think of integer division—that 3/2==1—as an annoyance. If I type in 3/2, I expect 1.5, darn it, not 1.

Indeed, this is an annoying gotcha to C and other integer-arithmetic languages, and more broadly, it shows us the dangers of operator overloading. O.o. is when an operator, such as /, does something different depending on the types involved. For two integer types, the slash effectively does a divide-and-truncate operation, and for anything else, it performs the usual division.

Recall the rule from Pointers Without malloc that things that behave differently should look different. That’s the failure of 3/2: integer division and floating-point division behave differently, but look identical. Confusion and bugs ensue.

Human language is redundant, which is a good thing, partly because it allows error correction. When Nina Simone says “ne me quitte pas” (which would translate word-for-word as “don’t leave me no”), it’s OK if you space out at the beginning, because “… me quitte pas” has the pas to indicate negation, and it’s OK if you space out at the end, because “ne me quitte …” has the ne to indicate negation.

Grammatical gender typically doesn’t have much real-world meaning, and sometimes objects will change depending on word choice. My favorite example is in Spanish, where el pene and la polla both refer to the same object, but the first is masculine and the second feminine. The real value to the gender is that it provides redundancy, forcing parts of the sentence to match, and thus adding clarity.

Programming languages avoid redundancy. We express negation exactly once, typically with only one character (!). But programming languages do have genders, where they’re called types. Generally, your verbs and your nouns need to agree in type (as in Arabic, Hebrew, and Russian, among other languages). With this added redundancy, you’d need matrix_multiply(a, b) when you have two matrices, and complex_multiply(a, b) when you have two complex numbers.

Operator overloading is about eliminating redundancy, writing a * b whether you have a pair of matrices, complex numbers, natural numbers, or sets. Here’s a snippet from an excellent essay on the cost of that reduced redundancy: “When you see the code i = j * 5; in C you know, at least, that j is being multiplied by five and the results stored in i. But if you see that same snippet of code in C++, you don’t know anything. Nothing.”^[19]. The problem is that you don’t know what * means until you look up the type for j, look through the inheritance tree for j’s type to determine which version of * you mean, and then you can start over with identifying i and how that relates to =, given the type of j.

So there’s the trade-off to operator overloading: you’ve saved space on the page, and didn’t have to type much of anything, but have lost all redundant hints and checks that b is actually a list and not the vector you thought it was.

The C custom is to closely follow the sort of gender-agreement rules I’d just described, for example:

//add two vectors in the GNU Scientific Library
gsl_vector *v1, *v2;
gsl_vector_add(v1, v2);

//Open a GLib I/O channel for reading at a given filename.
GError *e;
GIOChannel *f  = g_io_channel_new_file("indata.csv", "r", &e);

It’s more typing, and when you have 10 lines acting on the same structure, things start to look repetitive. Later, we’ll have some means of slightly reducing this redundancy.

_Generic

C provides limited overloading support via the C11 _Generic keyword. The keyword evaluates to a value based on the type of its input, which lets you write macros that consolidate some types together.

We need type-generic functions when we have a proliferation of types. Some systems provide a voluminous number of precise types, but every new type is another moving part that we have to support. For example, the GNU Scientific Library provides a complex number type, a complex vector type, and a vector type—and then there’s the C complex type. One could reasonably multiply any of those four types together, which theoretically means we need sixteen functions. Example 11-1 lists several of these functions; if you are not a complex vector aficionado, it would be entirely reasonable to recognize this example as a hairy mess and move on to the part where we clean it up.

Example 11-1. Where the sausage is made, for those of you with an interest in GSL complex types (complex.c)

#include "cplx.h"                 //gsl_cplx_from_c99; see below.
#include <gsl/gsl_blas.h>         //gsl_blas_ddot
#include <gsl/gsl_complex_math.h> //gsl_complex_mul(_real)

gsl_vector_complex *cvec_dot_gslcplx(gsl_vector_complex *v, gsl_complex x){
    gsl_vector_complex *out = gsl_vector_complex_alloc(v->size);
    for (int i=0; i< v->size; i++)
        gsl_vector_complex_set(out, i,
                            gsl_complex_mul(x, gsl_vector_complex_get(v, i)));
    return out;
}

gsl_vector_complex *vec_dot_gslcplx(gsl_vector *v, gsl_complex x){
    gsl_vector_complex *out = gsl_vector_complex_alloc(v->size);
    for (int i=0; i< v->size; i++)
        gsl_vector_complex_set(out, i,
                            gsl_complex_mul_real(x, gsl_vector_get(v, i)));
    return out;
}

gsl_vector_complex *cvec_dot_c(gsl_vector_complex *v, complex double x){
    return cvec_dot_gslcplx(v, gsl_cplx_from_c99(x));
}

gsl_vector_complex *vec_dot_c(gsl_vector *v, complex double x){
    return vec_dot_gslcplx(v, gsl_cplx_from_c99(x));
}

complex double ddot (complex double x, complex double y){return x*y;}   

void gsl_vector_complex_print(gsl_vector_complex *v){
    for (int i=0; i< v->size; i++) {
        gsl_complex x = gsl_vector_complex_get(v, i);
        printf("%4g+%4gi%c", GSL_REAL(x), GSL_IMAG(x), i < v->size-1 ? '\t' : '\n');
    }
}

: C-native complex numbers are multiplied with a simple *, like real numbers.

The cleanup happens in the header, Example 11-2. It uses _Generic to select one of the functions from Example 11-1 based on the input types. The first argument (the ‟controlling expression”) is not evaluated, but is simply checked for its type, and the value of the _Generic statement is selected based on that type. We want to select a function based on two types, so the first macro picks which of the second or third macros to use.

Example 11-2. Using _Generic to provide a simple front-end to the mess (cplx.h)

#include <complex.h> //nice names for C’s complex types
#include <gsl/gsl_vector.h> //gsl_vector_complex

gsl_vector_complex *cvec_dot_gslcplx(gsl_vector_complex *v, gsl_complex x);
gsl_vector_complex *vec_dot_gslcplx(gsl_vector *v, gsl_complex x);
gsl_vector_complex *cvec_dot_c(gsl_vector_complex *v, complex double x);
gsl_vector_complex *vec_dot_c(gsl_vector *v, complex double x);
void gsl_vector_complex_print(gsl_vector_complex *v);

#define gsl_cplx_from_c99(x) (gsl_complex){.dat= {creal(x), cimag(x)}} 

complex double ddot (complex double x, complex double y);

#define dot(x,y)  _Generic((x),                                 \      
                    gsl_vector*: dot_given_vec(y),              \
                    gsl_vector_complex*: dot_given_cplx_vec(y), \
                    default: ddot)((x),(y))

#define dot_given_vec(y) _Generic((y),                          \
                        gsl_complex: vec_dot_gslcplx,           \
                        default: vec_dot_c)

#define dot_given_cplx_vec(y) _Generic((y),                     \
                        gsl_complex: cvec_dot_gslcplx,          \
                        default: cvec_dot_c)

: gsl_complex and C99 complex double are both a two-element array consisting of real double followed by imaginary double [see the GSL manual and C99 & C11 §6.2.5(13)]. All we have to do is build the appropriate struct—and a compound literal is the perfect way to build a struct on the fly.
: The first use of x is not actually evaluated, just checked for its type. That means that a call like dot(x++, y) would increment x only once.

In Example 11-3, life is (mostly) easy again: we can use dot to find the product of a gsl_vector times a gsl_complex, a gsl_vector_complex times a C complex, and so on for a great many combinations. Of course, you still need to know the output type, because the return value of a scalar times a scalar is a scalar, not a vector, so the use of the output depends on the input types. The proliferation of types is a fundamental problem, but the _Generic facility at least provides a band-aid.

Example 11-3. The payoff: we can use dot (almost) regardless of input types (simple_cplx.c)

#include <stdio.h>
#include "cplx.h"

int main(){
    int complex a = 1+2I;                                                
    complex double b = 2+I;
    gsl_complex c = gsl_cplx_from_c99(a);

    gsl_vector *v = gsl_vector_alloc(8);
    for (int i=0; i< v->size; i++) gsl_vector_set(v, i, i/8.);

    complex double adotb = dot(a, b);                                    
    printf("(1+2i) dot (2+i): %g + %gi\n", creal(adotb), cimag(adotb));  

    printf("v dot 2:\n");
    double d = 2;
    gsl_vector_complex_print(dot(v, d));

    printf("v dot (1+2i):\n");
    gsl_vector_complex *vc = dot(v, a);
    gsl_vector_complex_print(vc);

    printf("v dot (1+2i) again:\n");
    gsl_vector_complex_print(dot(v, c));
}

: Declarations with complex are a bit like declarations with const: both complex int and int complex are valid.
: Finally, the payoff: this function will use the dot function four times, each with different input types.
: Here are the C-native means of getting the real and imaginary parts of a complex number.

Here’s my own rule of thumb for overloading, via _Generic or whatever other means: if users have no idea what the input type is, will they still get the right answer? Observe that the overloading of absolute value for int, float, and double work just fine with this rule. Following the example, it might also make sense to overload functions using the gsl_complex and the native C complex double.

Extending Structures and Dictionaries

Moving on from the syntactic tricks, let us get to the core problem of structuring our data.

In 1936, in response to a formal mathematical question (The Entscheidungsproblem), Alonso Church developed a lambda calculus, a formal means of describing functions and variables. In 1937, in response to the same question, Alan Turing described a formal language in the form of a machine with a tape holding data and a head that can be shifted along the tape to read and write to the tape. Later, Church’s lambda calculus and Turing’s machine were shown to be equivalent—any calculation you could express in one, you could express in the other. It’s been the same divide ever since, and Church’s and Turing’s constructions continue to be the root of how we structure our data.

The lambda calculus relies heavily on named lists; in lambda-inspired pseudocode, we might express a person’s information as:

(person (
    (name "Sinead")
    (age 28)
    (height 173)
))

With Turing’s machine, we would have a block of the tape set aside for the structure. The first few blocks would be a name, the next few would hold the age, and so on. Almost a century later, Turing’s tape is still a not-bad description of computer memory: recall from All the Pointer Arithmetic You Need to Know that this base-plus-offset form is exactly how C treats structures. We would write

typedef struct {
    char * name;
    double age, height;
} person;

person sinead = {.name="Sinead", .age=28, .height=173};

and sinead would point to a block of memory, and sinead.height would point to the tape immediately after name and age (and after any padding for alignment purposes).

Here are some differences between the list approach and the block-of-memory approach:

Telling the computer to go to a certain offset from a certain address is still among the fastest operations a machine can execute. Your C compiler even does the translation from labels to offsets during compile time. Conversely, finding something in the list requires a lookup: given the label "age", which element in the list corresponds and where is its data in memory? Every system has techniques to make this as fast a lookup as possible, but a lookup will always be more work than a simple base-plus-offset.
Adding a new element to a list is a much easier process than adding to a struct, which is basically fixed at compile time.
A C compiler can tell you at compile time that hieght is a typo, because it can look in the struct’s definition and see that there is no such element. Because a list is extensible, we won’t know that there is no hieght element until the program runs and checks on the list.

Extending a Structure

Those last two items demonstrate a direct tension: we want extensibility, wherein we can add elements to a structure; we want registration, wherein things that are not in the structure are flagged as errors. That’s a balance that has to be struck, and everybody implements controlled extension of an existing list differently.

C++, Java, and their siblings, have a syntax for producing a new type that is an instance of the type to be extended, but that inherits the old type’s elements. You still get base-plus-offset speed, and compile-time checking, but at the price of voluminous paperwork; where C has struct and its absurdly simple scoping rules from Scope, Java has implements, extends, final, instanceof, class, this, interface, private, public, protected.

Perl, Python, and many LISP-inspired languages are based on named lists, so that is a natural means of implementing a structure. Extend the list by just adding elements to it. Pros: fully extensible by just adding a new named item. Cons: as previously, we don’t get registration, and although you can improve the name search via various tricks, you’re a long way from the speed of a single base-plus-offset step. Many languages in this family have a class definition system, so that you can register a certain set of list items and thus check whether future uses conform to the definition, which, when done right, provides a nice compromise between checking and ease of extension.

Getting back to plain old C, its structs are the fastest way to access a structure’s elements, and we get compile-time checking at the price of runtime extensibility. If you want a flexible list that can grow as the runtime need arises, you will need a list structure, such as the GLib’s hashes, or the sample dictionary described later.

All the machinery you have in C for extending a structure is to wrap it in another structure. Say that we have a type defined via a form such as:

typedef struct {
    ...
} list_element_s;

which is already packaged and cannot be changed, but we’d like to add a type marker. Then we’ll need a new structure:

typedef struct {
    list_element_s elmt;
    char typemarker;
} list_element_w_type_s;

Pros: this is so stupid easy, and you still get the speed bonus. Cons: Now, every time you want to refer to the name of the element, you’ll need to write out the full path, your_typed_list->elmt->name, instead of what you’d get via a C++/Java-like extension: your_typed_list->name. Add a few layers to this and it starts to get annoying. You already saw in Pointers Without malloc how using aliases can help here.

C, with fewer seams

C11 allows us to include anonymous elements of a structure, which make structs within structs easier to use. Although this got added to the standard in December of 2011, it was a Microsoft extension for a long time before then, and gcc already allows it given the --ms-extensions flag on the command line.

The syntax: include another struct type somewhere in the declaration of the new structure, as per the point struct in Example 11-4, without a name for the element. Example 11-4 uses a bare structure type name, struct point, whereas a named declaration would be something like struct point ptelement. All of the elements of the referred-to structure are included in the new structure as if they were declared in place.

Example 11-4 extends a 2D point into a 3D point. So far, it is only notable because the threepoint struct extends the point seamlessly, to the point where users of the threepoint won’t even know that its definition is based on another struct.

Example 11-4. An anonymous substructure inside of a wrapping structure merges seamlessly into the wrapper (seamlessone.c)

#include <stdio.h>
#include <math.h>

typedef struct point {
   double x, y;
} point;

typedef struct {
    struct point;                              
    double z;
} threepoint;

double threelength (threepoint p){
    return sqrt(p.x*p.x + p.y*p.y + p.z*p.z);  
}

int main(){
    threepoint p = {.x=3, .y=0, .z=4};         
    printf("p is %g units from the origin\n", threelength(p));
}

: This is anonymous. The not-anonymous version would also have a name like struct point twopt.
: The x and y elements of the point structure look and behave exactly like the additional z element of the threepoint.
: Even the declaration gives no hint that x and y were inherited from an existing structure.

Warning

It’s not in the standard to use a typedef in the nested anonymous declaration. The standards committee explicitly rejected this, thus going out of its way to produce one of the few points in the C language where a typedef for a struct cannot substitute for the struct definition.^[20] Though, if you use the naming convention from Typedefs Save the Day, this just means that you need to put the word struct in front of the name of the structure type.

Now for the trick that really makes this useful. The original object, the point, was probably accompanied by several interface functions that are still potentially useful, like a length function measuring the distance between zero and the given point. How are we going to use that function, now that we don’t have a name for that subpart of the larger structure?

The solution is to use an anonymous union of a named point and an unnamed point. Being the union of two identical structures, the two structures share absolutely everything, and the only distinction is in the naming: use the named version when you need to call functions that use the original struct as an input, and use the anonymous version for seamless merging into the larger struct. Example 11-5 rewrites Example 11-4 using this trick.

Example 11-5. The point is seamlessly incorporated into a threepoint, and we still have a name for use with functions that operate on a point (seamlesstwo.c)

#include <stdio.h>
#include <math.h>

typedef struct point {
   double x, y;
} point;

typedef struct {
    union {
        struct point;                   
        point p2;                       
    };
    double z;
} threepoint;

double length (point p){
    return sqrt(p.x*p.x + p.y*p.y);
}

double threelength (threepoint p){
    return sqrt(p.x*p.x + p.y*p.y + p.z*p.z);
}

int main(){
    threepoint p = {.x=3, .y=0, .z=4};  
    printf("p is %g units from the origin\n", threelength(p));
    double xylength = length(p.p2);     
    printf("Its projection onto the XY plane is %g units from the origin\n", xylength);
}

: This is an anonymous structure.
: This is a named structure. Being part of a union, it is identical to the anonymous structure, differing only in having a name.
: The point structure is still seamlessly included in the threepoint structure, but...
: ...the p2 element is a named element like it always was, so we can use it to call the interface functions written around the original struct.

After the declaration threepoint p, we can refer to the X coordinate via p.x (because of the anonymous struct) or via p.p2.x (because of the named struct). The last line of the example shows the length when projecting onto the XY plane, and does so using length(p.p2).

From here, the possibilities span ℝ³. You can extend any structure and still use all of the functions associated with the original.

Warning

Did you notice this is the first time I’ve used the union keyword in this book? Unions are another one of those things where the explanation takes a paragraph—it’s like a struct, but all of the elements occupy the same space—and then the caveats about how to not hang yourself take up several pages. Memory is cheap, and for writing applications, we don’t have to care about memory alignment, so sticking to structs will reduce the possibility of errors, even when only one element is used at a time.

Inheriting from multiple structures with this method is risky. Pick one structure to be the base of the extension using the union trick, and let the others be extended via the plain vanilla substructure. For example, The GNU Scientific Library has matrix and vector types (where struct _gsl_matrix is typedeffed as gsl_matrix). Let us say that we want to put both into a single structure:

typedef struct{  //Alas, this will fail.
    struct _gsl_vector;
    struct _gsl_matrix;
} data_set;

data_set d;

This looks innocuous, until you find out that the _gsl_vector and the _gsl_matrix both have an element named data. When we refer to d.data, are we referring to the data element of the matrix or the vector? We have no syntax for selective inclusion or renaming struct elements, so the best we can do is pick the matrix or vector as primary and the other as secondary, callable only by its subelement name:

typedef struct{ //A vector with supporting matrix.
    struct _gsl_vector;         //anonymous and seamless.
    struct _gsl_matrix matrix;  //named
} data_set;

data_set d;

Implementing a Dictionary

Earlier, I mentioned differences between a C struct and a LISP-like dictionary of named elements. But a dictionary is an easy structure to generate, given what we have in struct-based C. Doing so is a fine chance to try building some objects. Please note, however, that fleshing this out and making it bulletproof has already been done by other authors; see the GLib’s keyed data tables or GHashTable, for example. The point here is simply that having compound structs plus simple arrays equals a short hop to a dictionary object.

We’re going to start with a simple key/value pair. Its mechanism will be in keyval.c. The header in Example 11-6 lists the structure and its interface functions.

Example 11-6. The header, or the public-facing portion of the key/value class (keyval.h)

typedef struct keyval{
   char *key;
   void *value;
} keyval;

keyval *keyval_new(char *key, void *value);
keyval *keyval_copy(keyval const *in);
void keyval_free(keyval *in);
int keyval_matches(keyval const *in, char const *key);

Those of you with experience in traditional object-oriented programming languages will find this to be very familiar. The first instinct when establishing a new object is to write down new/copy/free functions, and that is what the example does. After that, there are typically a few structure-specific functions, such as the keyval_matches function to check whether the key in a keyval pair matches the input string.

Having new/copy/free functions mean that your memory management worries are brief: in the new and copy functions, allocate the memory with malloc; in the free function, remove the structure with free, and having set up these functions, code that uses the object will never use malloc or free on them, but will trust that keyval_new, keyval_copy, and keyval_free will do all the memory management correctly.

Example 11-7. The typical boilerplate for a key/value object: a structure plus new/copy/free functions (keyval.c)

#include <stdlib.h> //malloc
#include <strings.h> //strcasecmp
#include "keyval.h"

keyval *keyval_new(char *key, void *value){
    keyval *out = malloc(sizeof(keyval));
    *out = (keyval){.key = key, .value=value};      
    return out;
}

/** Copy a key/value pair. The new pair has pointers to
  the values in the old pair, not copies of their data.  */
keyval *keyval_copy(keyval const *in){
    keyval *out = malloc(sizeof(keyval));
    *out = *in;                                     
    return out;
}

void keyval_free(keyval *in){ free(in); }

int keyval_matches(keyval const *in, char const *key){
    return !strcasecmp(in->key, key);
}

: Designated initializers make filling a struct easy.
: Remember, you can copy the contents of structs with an equals sign. If we wanted to copy the contents of pointers in the struct (rather than copy the pointers themselves), we would need more lines of code after this one.

Now that we have an object representing a single key/value pair, we can move on to establishing a dictionary as a list of these. Example 11-8 provides the header.

Example 11-8. The public parts of the dictionary structure (dict.h)

#include "keyval.h"

extern void *dictionary_not_found;    

typedef struct dictionary{
   keyval **pairs;
   int length;
} dictionary;

dictionary *dictionary_new (void);
dictionary *dictionary_copy(dictionary *in);
void dictionary_free(dictionary *in);
void dictionary_add(dictionary *in, char *key, void *value);
void *dictionary_find(dictionary const *in, char const *key);

: This will be the marker for when a key is not found in the dictionary. It has to be public.

As you can see, you get the same new/copy/free functions, plus a few other dictionary-specific functions, and a marker to be described later. Example 11-9 provides the private implementation.

Example 11-9. The implementation of the dictionary object (dict.c)

#include <stdio.h>
#include <stdlib.h>
#include "dict.h"

void *dictionary_not_found;

dictionary *dictionary_new (void){
    static int dnf;
    if (!dictionary_not_found) dictionary_not_found = &dnf;             
    dictionary *out= malloc(sizeof(dictionary));
    *out= (dictionary){ };
    return out;
}

static void dictionary_add_keyval(dictionary *in, keyval *kv){          
    in->length++;
    in->pairs = realloc(in->pairs, sizeof(keyval*)*in->length);
    in->pairs[in->length-1] = kv;
}

void dictionary_add(dictionary *in, char *key, void *value){
    if (!key){fprintf(stderr, "NULL is not a valid key.\n"); abort();}  
    dictionary_add_keyval(in, keyval_new(key, value));
}

void *dictionary_find(dictionary const *in, char const *key){
    for (int i=0; i< in->length; i++)
        if (keyval_matches(in->pairs[i], key))
            return in->pairs[i]->value;
    return dictionary_not_found;
}

dictionary *dictionary_copy(dictionary *in){
    dictionary *out = dictionary_new();
    for (int i=0; i< in->length; i++)
        dictionary_add_keyval(out, keyval_copy(in->pairs[i]));
    return out;
}

void dictionary_free(dictionary *in){
    for (int i=0; i< in->length; i++)
        keyval_free(in->pairs[i]);
    free(in);
}

: It is reasonable to have a NULL value in the key/value table, so we need a unique marker to indicate a missing value.
: Recall that a function marked as static cannot be used outside the file, so this is one more reminder that the function is private to the implementation.
: I cheated again: using abort like this is bad form. It would be better to use a macro like the one in stopif.h (Example 10-2). I did it this way to demonstrate a feature of the test harness in Unit Testing.

Now that we have a dictionary, Example 11-10 can use it without thinking about memory management, which the new/copy/free/add functions take care of, and without making reference to key/value pairs, because that is one level too low for our purposes.

Example 11-10. Sample usage of the dictionary object; no need to delve into the guts of the struct, because the interface functions provide all we need (dict_use.c)

#include <stdio.h>
#include "dict.h"

int main(){
    int zero = 0;
    float one = 1.0;
    char two[] = "two";
    dictionary *d = dictionary_new();
    dictionary_add(d, "an int", &zero);
    dictionary_add(d, "a float", &one);
    dictionary_add(d, "a string", &two);
    printf("The integer I recorded was: %i\n", *(int*)dictionary_find(d, "an int"));
    printf("The string was: %s\n", (char*)dictionary_find(d, "a string"));
    dictionary_free(d);
}

So writing a struct and its new/copy/free and other associated functions was enough to give us the right level of encapsulation: the dictionary didn’t have to care about the internals of the key/value pair, and the application didn’t have to worry about dictionary internals.

The boilerplate code is not as bad as it is in some languages, but there is certainly some repetition to the new/copy/free functions. And as the examples continue, you’ll see this boilerplate several times more.

At some point, I even wrote myself macros to autogenerate these. For example, the copy functions differ only in dealing with internal pointers, so we could write a macro to automate all the boilerplate not about internal pointers:

#define def_object_copy(tname, ...)         \
    void * tname##_copy(tname *in) {        \
        tname *out = malloc(sizeof(tname)); \
        *out = *in;                         \
        __VA_ARGS__;                        \
        return out;                         \
    }

def_object_copy(keyval) //Expands to the previous declarations of keyval_copy.

But the redundancy is nothing to worry about all that much. Despite our mathematical æsthetic of minimizing repetition and text on the page, sometimes having more code really does make the program more readable and robust.

Base Your Code on Pointers to Objects

Why did I base all of these things on pointers to data structures, instead of just passing around data structures directly? If you use a plain struct, the new/copy/free functions write themselves:

new: Use designated initializers on the first line where you need a struct. As an added plus, structures can be declared at compile time, so they are immediately available to users without an initial call to a setup function.
copy: The equals sign does this.
free: Don’t bother; it’ll go out of scope soon enough.

So we’re making things more difficult for ourselves with pointers. Yet from what I’ve seen, there’s consensus on using pointers to objects as the base of our designs.

Pros to using pointers:

Copying a single pointer is cheaper than copying a full structure, so you save a microsecond on every function call with a struct as an input. Of course, this only adds up after a few billion function calls.
Data structure libraries (your trees and linked lists, for example) are all written around hooks for a pointer.
Now that you’re filling a tree or a list, having the system automatically free the struct at the end of the scope in which it was created might not be what you want.
Many of your functions that take in a struct will modify the struct’s contents, meaning that you’ve got to pass a pointer to the struct anyway. Having some functions that take in the struct itself and some that take in a pointer to struct is confusing (I have written an interface like this and I regret it), so you might as well just send a pointer every time.
If the contents of the struct include a pointer to data elsewhere, then the convenience bonus from using a plain struct evaporates anyway: if you want a deep copy (wherein the data pointed to is copied, not just the pointer) then you need a copy function, and you will probably want a free function to make sure the internal data is eliminated.

There’s no one-size-fits-all set of rules for using structs. As your project gets bigger, and a throwaway struct grows into a core of how your data is organized, the benefits to pointers wax and the benefits to nonpointers wane.

Functions in Your Structs

A struct can include functions among its member elements as easily as it can hold typical variables.

Given a struct object_s and a function fn that is somehow related to it, we can go the usual route of leaving the function outside of the structure but give it a name that indicates a relationship, object_fn(...), or we could let fn be an element of an object_s, so that when we declare object_s xo, we would call the function via xo.fn(...).

This is, for the most part, a stylistic choice, affecting how we look up functions in the documentation and how the code looks on the page. The documentation issue, by the way, is why I prefer the object_fn naming scheme over the fn_object scheme: with the first form, the documentation’s index lists all of object_s’s associated functions in one place.

The real advantage of the element-of-struct form is that you can more easily change the function associated with every instance of the object: given the declaration object_s xo, yo and a function element of the struct named add, then xo.add and yo.add could be entirely different functions. On the plus side, you can send xo and yo to a routine with a header do_math(object_s in), and it can call in.add somewhere in there without any regard to what add actually does. On the minus side, we run risk of once again breaking the rule that things that do different things should look different: if the function xo.add has subtly different side effects from yo.add, you have no warnings.

This is why I generally prefer the object_fn form for functions related to objects. If there are two or three similar but distinct types of operation on the same object, I can give them different names (like dictionary_add and dictionary_add_keval in Example 11-9). I reserve the xo.fn form for when I expect that almost every instance of object_s will have a different version of fn.

It’s hard to come by situations where every instance of an object has different methods associated, which is the point where having methods inside of an object really starts to make a difference. I’ve had the good fortune of running across such a situation, when writing a library of statistical models, named Apophenia (see The GNU Scientific Library). This segment will present a model_s object along similar lines, and demonstrate the details of hooking different procedures to every object. I’ve written this section so that if you read all of the mathematics and statistical jargon as blahblahblah, then you should be able to follow along just fine.

Broadly, statistical models are all the same: they have some procedure, like the equation for a bell curve or a story about a line of best fit, and some parameters that fine-tune the procedure to the data. For example, we might specify that the data is from a bell-curve Normal distribution, whose tuning parameters are a mean (μ) and a standard deviation (σ). For the input data [1, 2, 3, 4, 5], the mean is three (because it’s symmetric around three), and if you do the math, you’ll find that the sample σ is about 1.58. The black box representing the estimation procedure would take in that data and spit out (μ, σ) = (3, 1.58). If I give the black box [100, 100.2, 100.8, 100.7, 100.4], then it will spit out a larger μ and smaller σ: (μ, σ) = (100.42, 0.33).

You might be familiar with long tail distributions, like how a handful of books sell millions of copies but thousands of books only sell a few dozen copies. This is hardly a bell curve where everything drifts around an average sales figure, and the calculations for estimating the parameter for a Zipf distribution (one of the more common formalizations of the long tail idea) have nothing to do with the math for a bell curve. The black box has a similar form, though: give it [1, 2, 3, 4, 5], and it spits out β=1.7; give it [100, 100.2, 100.8, 100.7, 100.4] and it spits out β=1.2.

So we’ve hit heterogeneity: the model struct should have elements named parameters and data, and there should be an estimate function to go from data to parameters, but it will be a different function for every model.

The implementation of this is not much of a stretch from the typical structure implementation. Assume a typedef for data_s appropriate to describe a data set; then we have enough to declare Normal and Zipf objects. Here is a mock-up of the key parts of the process:

typedef model_s * (*data_to_estimated_model)(model_s *, data_s *);   

typedef struct model_s {
    data_s *parameters, *data;
    data_to_estimated_model estimate;                                
} model_s;

//Takes in a copy of the input model with NULL parameters.
//Returns a copy of the input model with the parameters calculated.
static model_s* normal_est(model_s *in, data_s *d){
    model_s *outmodel = model_copy(in);                              
    //math goes here to calculate outmodel->parameters;
    return outmodel;
}

static model_s* zipf_est(model_s *in, data_s *d){
    model_s *outmodel = model_copy(in);
    //entirely different math goes here to set outmodel->parameters;
    return outmodel;
}

model_s *normal_model = &(model_s){.estimate = normal_est};          
model_s *zipf_model   = &(model_s){.estimate = zipf_est};

data_s *data = text_to_data("infile");                               
model_s *norm_est = normal_model->estimate(normal_model, data);
model_s *zipf_est = zipf_model->estimate(zipf_model, data);

: As per Typedef as a teaching tool, a good typedef makes life with function pointers much more pleasant.
: We can put a pointer to a function into a struct as easily as we include any other pointer. Now all that’s left to do is establish what function any given function pointer will point to.
: I assume that model_copy is defined elsewhere. The copy function is the same for every model, so there’s no major benefit to putting it inside the structure.
: Here, two model_s structures are initialized, via designated initializers. The estimate pointers in the two models each point to a different function.
: An imaginary sample use of the two models to produce versions with parameters estimated.

By the last two lines, we are on our way to having a uniform interface to entirely distinct functions. You could picture a function that takes in data and a model_s, names it m, and calls m.estimate(m, data).

Note the use of the static keyword, which indicates that outside of this file, no code will be able to call normal_est or zipf_est by those names. They will, however, be able to use the names normal_mode.estimate and zipf_model.estimate to call them.

There are a few bells and whistles that we’d like to add. First, zipf_model.estimate(zipf_model, data) is a redundant form, with a repetition of zipf_model. It would be nice to be able to write zipf_model.estimate(data) and let the system just know that the first argument should be the object making the call. The function might see a special variable named this or self, or we could add a special-case rule that object.fn(x) gets reshuffled to fn(object, x).

Sorry, but it’s not going to happen in C.

C doesn’t define magic variables for you, and it is always honest and transparent about what parameters get sent in to a function. Normally, if we want to futz around with the parameters of a function, we do it with the preprocessor, which will gladly rewrite f(anything) to f(anything else). However, all of the transformations happen to what goes on inside of the parens. There’s no way to get the preprocessor to transform the text s.prob(d) to s.prob(s, d). If you don’t want to slavishly imitate C++-type syntax, you can write macros like:

#define Estimate(in, ...) (in).estimate((in), __VA_ARGS__)
#define Copy(in, ...) (in).copy((in), __VA_ARGS__)
#define Free(in, ...) (in).free((in), __VA_ARGS__)

But now you’ve cluttered up the global namespace with the Estimate, Copy, and Free symbols. Maybe it’s worth it to you (especially given that every function should have associated copy and free functions).

You could keep the namespace organized and prevent name collisions by naming your macros appropriately:

#define Model_estimate(in, ...) (in).estimate((in), __VA_ARGS__)
#define Model_copy(in, ...) (in).copy((in), __VA_ARGS__)

My preferred alternative is a dispatch function, a thin wrapper function that solves this redundancy annoyance and provides one more benefit over the simple macros shown: defaults. Let us say that we have no estimation routine for a given model. Given a log likelihood function we could use maximum likelihood techniques to estimate the parameters (that is, blahblahblah). The default function would look much like the specific estimation routines earlier. Here is a mock-up, making use of a presumed log_likelihood method added to the struct:

model_s default_estimation(model_s *in, data_s *d){
    model_s *outmodel = _model_copy(in);
    //math making use of additional in->log_likelihood element here
    return outmodel;
}

Now for the dispatch function:

model_s * model_estimate(model_s *p, data_s *d){
    //Place error checking here.
    if (p->estimate) return p->estimate(p, d);
    else return default_estimation(p, d);
}

Usage:

model_s normal_model = {.estimate = normal_est};
model_s ad_hoc_model = {.log_likelihood = ad_hoc_ll};

model_s *norm_est = model_estimate(&normal_model, data);
model_s *adhoc_est = model_estimate(&ad_hoc_model, data);

We have achieved a homogeneous form out of still more heterogeneous parts—a model structure’s estimate function could be NULL and we can still call the same model_estimate function, though the default will need to have a .log_likelihood element.

So dispatch functions gave us default routines, solved the annoyance of not having a magic this or self variable, and did so in a manner that looks similar to the usual interface functions like model_copy or model_free.

There are other ways to do it. Earlier, I used designated initializers to set up the functions, so unspecified elements are NULL and a dispatch function makes sense. If we required that users always use a model_new function, then we could set the default functions there. Then eliminating the redundancy of mymodel.estimate(mymodel, data) can be done by a simple macro, as previously.

Once again, you’ve got options. We already have more than enough syntactic tools to uniformly call diverse functions for diverse objects. That just leaves the hard part of writing those diverse functions so that calling them in a uniform manner always behaves as expected.

Count References

The remainder of this chapter presents a few more examples of building objects, and how we can modify the boilerplate new/copy/free functions to handle nontrivial situations. As already noted, the examples are going to be more involved, taking into account more real-world considerations and doing something interesting.

The first example presents a small library that has one structure to speak of, which is intended to read an entire file into a single string. Having all of Moby Dick in a single string in memory is not a big deal at all, but having a thousand copies of it floating around starts to be wasteful. So instead of copying the potentially very long data string, we’ll have views that just mark different start and end points.

Now that we have several views of the string, we need to free the string exactly once, when the string no longer has any views attached. Thanks to the object framework, it’s easy to make this happen.

The second example, an agent-based microsimulation of group formation, has a similar problem: the groups should exist as long as they have members, and need to be freed if and when the last member leaves.

Example: A Substring Object

The trick to having a lot of objects pointing to the same string is to add a reference count element to the structure. Modify the four boilerplate elements as follows:

The type definition includes a pointer-to-integer named refs. It will be set up only once (via the new function), and all copies (made via the copy function) will share the string and this reference counter.
The new function sets up the refs pointer and sets *refs = 1.
The copy function copies the original struct into the output copy and increments the reference count.
The free function decrements the reference count and, if it has hit zero, frees the shared string.

Example 11-11 provides the header for the string manipulation example, fstr.h, which introduces the key structure representing a segment of a string and an auxiliary structure representing a list of these string segments.

Example 11-11. The public tip of the iceberg (fstr.h)

#include <stdio.h>
#include <stdlib.h>
#include <glib.h>

typedef struct {        
    char *data;
    size_t start, end;
    int* refs;
} fstr_s;

fstr_s *fstr_new(char const *filename);
fstr_s *fstr_copy(fstr_s const *in, size_t start, size_t len);
void fstr_show(fstr_s const *fstr);
void fstr_free(fstr_s *in);

typedef struct {        
    fstr_s **strings;
    int count;
} fstr_list;

fstr_list fstr_split (fstr_s const *in, gchar const *start_pattern);
void fstr_list_free(fstr_list in);

: I hope these headers are starting to get boring for you. It’s just the same typdef/new/copy/free over and over again.
: The fstr_list struct had originally been intended to be a throwaway, not quite a full object, but I found myself using it to organize much of the code. Such accidental falling upon structure is a good thing, and we should encourage it. Notice that the fstr_split function returns the list, not a pointer to the list.

Example 11-12 shows the library, fstr.c. It uses GLib to read in the text file and for Perl-compatible regular expression parsing. The numbered callouts focus on the steps at the head of this section, so you can follow them to trace the use of the refs element to implement reference counting.

Example 11-12. An object representing a substring (fstr.c)

#include "fstr.h"
#include "string_utilities.h"

fstr_s *fstr_new(char const *filename){
    fstr_s *out = malloc(sizeof(fstr_s));
    *out = (fstr_s){.start=0, .refs=malloc(sizeof(int))};          
    out->data = string_from_file(filename);
    out->end = out->data ? strlen(out->data) : 0;
    *out->refs = 1;
    return out;
}

/** Make a new fstr_s that is a substring of the input fstr_s.
\param in    The parent string.
\param start The offset in the original string where the substring starts.
\param len   The length of the substring. If longer than the available text,
                 the substring will only go to the end of the parent string.
*/
fstr_s *fstr_copy(fstr_s const *in, size_t start, size_t len){     
    fstr_s *out = malloc(sizeof(fstr_s));
    *out=*in;
    out->start += start;
    if (in->end > out->start + len)
        out->end = out->start + len;
    (*out->refs)++;                                                
    return out;
}

void fstr_free(fstr_s *in){                                        
    (*in->refs)--;
    if (!*in->refs) {
        free(in->data);
        free(in->refs);
    }
    free(in);
}

/** Split an input string into a sequence of substrings 
\param in             The input string to split.
\param start_pattern  The regex marking the beginning of a new substring.
\return               A list of substrings.
*/
fstr_list fstr_split (fstr_s const *in, gchar const *start_pattern){
    fstr_s **out=malloc(sizeof(fstr_s*));
    int outlen = 1;
    out[0] = fstr_copy(in, 0, in->end);
    GRegex *start_regex = g_regex_new (start_pattern, 0, 0, NULL);
    gint mstart=0, mend=0;
    fstr_s *remaining = fstr_copy(in, 0, in->end);
    do {
        GMatchInfo *start_info;
        g_regex_match(start_regex, &remaining->data[remaining->start],
                                    0, &start_info);
        g_match_info_fetch_pos(start_info, 0, &mstart, &mend);
        g_match_info_free(start_info);
        if (mend > 0 && mend < remaining->end - remaining->start){ 
            out = realloc(out, ++outlen * sizeof(fstr_s*));
            out[outlen-1] = fstr_copy(remaining, mend, remaining->end-mend);
            out[outlen-2]->end = remaining->start + mstart;
            remaining->start += mend;
        } else break;
    } while (1);
    fstr_free(remaining);
    g_regex_unref(start_regex);
    return (fstr_list){.strings=out, .count=outlen};
}

void fstr_list_free(fstr_list in){
    for (int i=0; i < in.count; i++) fstr_free(in.strings[i]);
    free(in.strings);
}

void fstr_show(fstr_s const *fstr){
    printf("%.*s", (int)fstr->end-fstr->start, &fstr->data[fstr->start]);
}

: For a new fstr_s, the reference count is set to one. Otherwise, this function is a boilerplate new object function.
: The copy function copies the fstr_s sent in, and sets the start and endpoints to the substring given (making sure that the endpoint doesn’t go past the endpoint of the input fstr_s).
: Here’s where the reference count gets incremented.
: Here’s where the reference count gets used, to determine whether the base data should be freed or not.
: Else, no match or out of bounds.

And finally, an application. To make this work, you’ll need a copy of Moby Dick, or the Whale, by Herman Melville. If you don’t have a copy on your drive, try Example 11-13 to download one from Project Gutenberg.

Example 11-13. Use curl to get the Project Gutenberg edition of Moby Dick, then use sed to cut the Gutenberg header and footer; you might have to ask your package manger to install curl (find.moby)

if [ ! -e moby ] ; then
 curl -A "Mozilla/4.0" http://www.gutenberg.org/cache/epub/2701/pg2701.txt  \
     | sed -e '1,/START OF THIS PROJECT GUTENBERG/d'       \
     | sed -e '/End of Project Gutenberg/,$d'              \
     > moby
fi

Now that you have a copy of the book, Example 11-15 splits it into chapters and uses the same splitting function to count the uses of the words whale(s) and I in each chapter. Notice that the fstr structs can be used as opaque objects at this point, using only the new, copy, free, show, and split functions.

The program requires GLib, fstr.c, and the string utilities from earlier in the book, so the basic makefile is now as in Example 11-14.

Example 11-14. A sample make file for the cetology program (cetology.make)

P=cetology
CFLAGS=`pkg-config --cflags glib-2.0` -g -Wall -std=gnu99 -O3
LDLIBS=`pkg-config --libs glib-2.0`
objects=fstr.o string_utilities.o

$(P): $(objects)

Example 11-15. An example, in which a book is split into chapters and characteristics of each chapter counted (cetology.c)

#include "fstr.h"

int main(){
    fstr_s *fstr = fstr_new("moby");
    fstr_list chapters = fstr_split(fstr, "\nCHAPTER");
    for (int i=0; i< chapters.count; i++){
        fstr_list for_the_title=fstr_split(chapters.strings[i],"\\.");
        fstr_show(for_the_title.strings[1]);
        fstr_list me     = fstr_split(chapters.strings[i], "\\WI\\W");
        fstr_list whales = fstr_split(chapters.strings[i], "whale(s|)");
        fstr_list words  = fstr_split(chapters.strings[i], "\\W");
        printf("\nch %i, words: %i.\t Is: %i\twhales: %i\n", i, words.count-1,
                me.count-1, whales.count-1);
                
        fstr_list_free(for_the_title);
        fstr_list_free(me);
        fstr_list_free(whales);
        fstr_list_free(words);
    }
    fstr_list_free(chapters);
    fstr_free(fstr);
}

To give you incentive to try the program, I won’t reprint the results in detail. But I will give some notes, which generally point to how hard it would be for Mr. Melville to publish or even blog the book here in the modern day:

Chapter lengths range by an order of magnitude.
Whales don’t get discussed all that much until around Chapter 30.
The narrator decidedly has a voice. Even in the famed cetology chapter, he uses the first person singular 60 times, personalizing what would otherwise be an encyclopedia chapter.
GLib’s regex parser is a little slower than I’d hoped it’d be.

An Agent-Based Model of Group Formation

This example is an agent-based model of group membership. Agents are on a two-dimensional preference space (because we’ll plot the groups) in the square between (-1, -1) and (1, 1). At each round, agents will join the group with the best utility to the agent. An agent’s utility from a group is -(distance to group’s mean position + M*number of members). The group’s mean position is the mean of the positions of the group’s members (excluding the agent querying the group), and M is a constant that scales how much the agents care about being in a large group relative to how much they care about the group’s mean position: if M is near zero, then size of group is basically irrelevant, and agents care only about proximity; as M goes to infinity, position becomes irrelevant, and only group size matters.

With some random odds, the agent will originate a new group. However, because agents are picking a new group every period, the agent may abandon that newly originated group in the next period.

The problem of reference counting is similar, and the process is roughly similar for this case:

The type definition includes an integer named counter.
The new function sets counter = 1.
The copy function sets counter++.
The free function queries if(--counter==0), and if yes, then free all shared data; or else, just leave everything as is, because we know there are still references to the structure.

Again, as long as your changes to the structure are entirely via its interface functions, you don’t have to think about memory allocation when using the object at all.

The simulation takes almost 125 lines of code, and because I used CWEB to document it, the code files total almost double that length (where I gave some tips on reading and writing CWEB in Literate Code with CWEB). Given the literate coding style, this should be very readable; even if you’re in the habit of skipping big blocks of code, maybe give it a skim. If you have CWEB on hand, you can generate the PDF documentation and try reading it in that format.

The output from this program is intended to be piped to Gnuplot, a plotting program that stands out for being easy to automate. Here is a command-line script that uses a here document to pipe the given text to Gnuplot, including a series of data points (with an e to mark the end of the series).

cat << "------" | gnuplot --persist
set xlabel "Year"
set ylabel "U.S. Presidential elections"
set yrange [0:5]
set key off
plot '-' with boxes
2000, 1
2001, 0
2002, 0
2003, 0
2004, 1
2005, 0
e
------

You can probably already picture producing commands to Gnuplot programmatically, via a printf or two for the plot settings, and a for loop to output the data set. Further, sending a series of plots to Gnuplot generates an animation sequence.

The simulation below produces an animation like this, so you can run the simulation via ./groups | gnuplot to display the animation on-screen. It’s hard to print an animation, so you’ll have to run it yourself. You will see that, even though such behavior was not programmed into the simulation, new groups cause nearby groups to shift, producing an evenly-spaced, uniform distribution of group positions. Political scientists have often observed similar behavior in the space of political party positions: when new parties enter, existing parties adjust their positions accordingly.

Now for the header. What I call the join and exit functions might more commonly be read as the copy and free functions. The group_s structure has a size element, which is the number of group members—the reference count. You can see that I use Apophenia and GLib. Notably, the groups are held in a linked list, private to the groups.c file; maintaining that list will require fully two lines of code, including a call to g_list_append and g_list_remove (Example 11-16).

Example 11-16. The public portion of the group_s object. (groups.h)

#include <apop.h>
#include <glib.h>

typedef struct {
    gsl_vector *position;
    int id, size;
} group_s;

group_s* group_new(gsl_vector *position);
group_s* group_join(group_s *joinme, gsl_vector *position);
void group_exit(group_s *leaveme, gsl_vector *position);
group_s* group_closest(gsl_vector *position, double mb);
void print_groups();

Now for the file defining the details of the group object (shown in Example 11-17).

Example 11-17. The group_s object. (groups.w)

@ Here in the introductory material, we include the header and specify
the global list of groups that the program makes use of. We'll need
new/copy/free functions for each group.

@c
#include "groups.h"

GList *group_list;
@<new group@>
@<copy group@>
@<free group@>

@ The new group method is boilerplate: we |malloc| some space,
fill the struct using designated initializers, and append the newly-formed
group to the list.

@<new group@>=
group_s *group_new(gsl_vector *position){
    static int id=0;
    group_s *out = malloc(sizeof(group_s));
    *out = (group_s) {.position=apop_vector_copy(position), .id=id++, .size=1};
    group_list = g_list_append(group_list, out);
    return out;
}

@ When an agent joins a group, the group is `copied' to the agent, but
there isn't any memory being copied: the group is simply modified to
accommodate the new person. We have to increment the reference count, which
is easy enough, and then modify the mean position.  If the mean position
without the $n$th person is $P_{n-1}$, and the $n$th person is at position
$p$, then the new mean position with the person, $P_n$ is the weighted sum.

$$P_n = \left( (n-1)P_{n-1}/n \right) + p/n.$$

We calculate that for each dimension.

@<copy group@>=
group_s *group_join(group_s *joinme, gsl_vector *position){
   int n = ++joinme->size;  //increment the reference count
   for (int i=0; i< joinme->position->size; i++){
       joinme->position->data[i] *= (n-1.)/n;
       joinme->position->data[i] += position->data[i]/n;
   }
   return joinme;
}

@ The `free' function really only frees the group when the reference count
is zero. When it isn't, then we need to run the data-augmenting formula
for the mean in reverse to remove a person.

@<free group@>=
void group_exit(group_s *leaveme, gsl_vector *position){
   int n = leaveme->size--;  //lower the reference count
   for (int i=0; i< leaveme->position->size; i++){
       leaveme->position->data[i] -= position->data[i]/n;
       leaveme->position->data[i] *= n/(n-1.);
   }
   if (leaveme->size == 0){ //garbage collect?
       gsl_vector_free(leaveme->position);
       group_list= g_list_remove(group_list, leaveme);
       free(leaveme);
   }
}

@ I played around a lot with different rules for how exactly people
evaluate the distance to the groups. In the end, I wound up using the $L_3$
norm. The standard distance is the $L_2$ norm, aka Euclidian distance,
meaning that the distance between $(x_1, y_1)$ and $(x_2, y_2)$ is
$\sqrt{(x_1-x_2)^2+(y_1-y_2)^2}$. This is $L_3$,
$\sqrt[3]{(x_1-x_2)^3+(y_1-y_2)^3}$.
This and the call to |apop_copy| above are the only calls to the Apophenia
library; you could write around them if you don't have that library on hand.

@<distance@>=
apop_vector_distance(g->position, position, .metric='L', .norm=3)

@ By `closest', I mean the group that provides the greatest benefit,
by having the smallest distance minus weighted size. Given the utility
function represented by the |dist| line, this is just a simple |for|
loop to find the smallest distance.

@c
group_s *group_closest(gsl_vector *position, double mass_benefit){
    group_s *fave=NULL;
    double smallest_dist=GSL_POSINF;
    for (GList *gl=group_list; gl!= NULL; gl = gl->next){
        group_s *g = gl->data;
        double dist= @<distance@> - mass_benefit*g->size;
        if(dist < smallest_dist){
            smallest_dist = dist;
            fave = g;
        }
    }
    return fave;
}

@ Gnuplot is automation-friendly. Here we get an animated simulation with
four lines of plotting code. The header |plot '-'| tells the system to plot
the data to follow, then we print the $(X, Y)$ positions, one to a line. The
final |e| indicates the end of the data set. The main program will set some
initial Gnuplot settings.

@c
void print_groups(){
    printf("plot '-' with points pointtype 6\n");
    for (GList *gl=group_list; gl!= NULL; gl = gl->next)
        apop_vector_print(((group_s*)gl->data)->position);
    printf("e\n");
}

Now that we have a group object and interface functions to add, join, and leave groups, the program file can focus on the simulation procedure: defining the array of persons followed by the main loop of rechecking memberships and printing out (Example 11-18).

Example 11-18. The agent-based model, making use of the group_s object (groupabm.w)

@* Initializations.

@ This is the part of the agent-based model with the handlers for the
|people| structures and the procedure itself.

At this point all interface with the groups happens via the
new/join/exit/print functions from |groups.cweb.c|. Thus, there is zero
memory management code in this file--the reference counting guarantees us
that when the last member exits a group, the group will be freed.

@c
#include "groups.h"

int pop=2000,
    periods=200,
    dimension=2;

@ In |main|, we'll initialize a few constants that we can't have as static
variables because they require math.

@<set up more constants@>=
    double  new_group_odds = 1./pop,
            mass_benefit = .7/pop;
    gsl_rng *r = apop_rng_alloc(1234);

@* The |person_s| structure.

@ The people in this simulation are pretty boring: they do not die, and do
not move. So the struct that represents them is simple, with just |position|
and a pointer to the group of which the agent is currently a member.

@c
typedef struct {
    gsl_vector *position;
    group_s *group;
} person_s;

@ The setup routine is also boring, and consists of allocating a uniform 
random vector in two dimensions.

@c
person_s person_setup(gsl_rng *r){
    gsl_vector *posn = gsl_vector_alloc(dimension);
    for (int i=0; i< dimension; i++)
        gsl_vector_set(posn, i, 2*gsl_rng_uniform(r)-1);
    return (person_s){.position=posn};
}

@* Group membership.

@ At the outset of this function, the person leaves its group.
Then, the decision is only whether to form a new group or join an existing one.

@c
void check_membership(person_s *p, gsl_rng *r,
                        double mass_benefit, double new_group_odds){
    group_exit(p->group, p->position);
    p->group = (gsl_rng_uniform(r) < new_group_odds)
             ? @<form a new group@>
             : @<join the closest group@>;
}

@
@<form a new group@>=
group_new(p->position)

@
@<join the closest group@>=
group_join(group_closest(p->position, mass_benefit), p->position)

@* Setting up.

@ The initialization of the population. Using CWEB's macros, it is at this point
self-documenting.

@c
void init(person_s *people, int pop, gsl_rng *r){
    @<position everybody@>
    @<start with ten groups@>
    @<everybody joins a group@>
}

@
@<position everybody@>=
    for (int i=0; i< pop; i++)
        people[i] = person_setup(r);

@ The first ten people in our list form new groups, but because everybody's
position is random, this is assigning the ten groups at random.

@<start with ten groups@>=
    for (int i=0; i< 10; i++)
        people[i].group = group_new(people[i].position);

@
@<everybody joins a group@>=
    for (int i=10; i< pop; i++)
        people[i].group = group_join(people[i%10].group, people[i].position);

@* Plotting with Gnuplot.

@ This is the header for Gnuplot. I arrived at it by playing around on 
Gnuplot's command line, then writing down my final picks for settings here.

@<print the Gnuplot header@>=
printf("unset key;set xrange [-1:1]\nset yrange [-1:1]\n");

@ Gnuplot animation simply consists of sending a sequence of plot statements.
@<plot one animation frame@>=
print_groups();

@* |main|.

@ The |main| routine consists of a few setup steps, and a simple loop:
calculate a new state, then plot it.

@c
int main(){
    @<set up more constants@>
    person_s people[pop];
    init(people, pop, r);

    @<print the Gnuplot header@>
    for (int t=0; t< periods; t++){
        for (int i=0; i< pop; i++)
            check_membership(&people[i], r, mass_benefit, new_group_odds);
        @<plot one animation frame@>
    }
}

^[18]

“I once attended a Java user group meeting where James Gosling (Java’s inventor) was the featured speaker. During the memorable Q&A session, someone asked him: ‘If you could do Java over again, what would you change?’ ‘I’d leave out classes,’ he replied.”
—Allen Holub, Why extends is evil

^[19]Originally at http://www.joelonsoftware.com/articles/Wrong.html; reprinted in [Spolsky 2008].

^[20]The discussion is from David Keaton, “Clarifications to Anonymous Structures and Unions”, WG14/N1549, 22 December 2010, voted on and adopted by the Committee in March of 2011.

Previous Chapter

10. Better Structures

Next Chapter

12. Libraries