We favor the simple expression of the complex thought.
...
We are for flat forms
Because they destroy illusion and reveal truth.—Le Tigre, “Slideshow at Free University”
Here is the common format for the typical library, in C or in any other language:
An XML library, for example, would have a structure representing an XML document and perhaps views of the document, plus lots of functions for going between the data structure and the XML file on disk, querying the structure for elements, et cetera. A database library would have a structure representing the state of communications with the database, and perhaps structures representing tables, plus lots of functions for talking to the database and dissecting the data it sends.
This is an eminently sensible way to organize a program or a library. It is the means by which an author can represent concepts with nouns and verbs that are appropriate to the problem at hand.
The first fun exercise in object-oriented programming (OOP) is defining the term, and although I won’t waste time (and invite flame wars) by giving a precise definition, the preceding description of an object-oriented library should give you a feel for what we are going after: a few central data structures, each with a set of functions that act on those central data structures. Building a structure like this is language-independent, unless you have a serious and firm belief in the Sapir-Whorf hypothesis.
Because OOP has so much baggage associated, the first part of this chapter is mostly concerned with what portion of the OOP world is really necessary (or even desirable) for structuring our writing. The discussion will then turn to what an object looks like in C, such as how we can have one structure inherit from another, or when to put struct-related methods inside of the struct itself. The chapter will conclude with a few full-scale examples of some objects with nontrivial problems, like the need for reference counting. Even with the limited syntactic tools we have, the framework works gracefully and is maintainable.
C is great for its simplicity, but with such a simple grammar, how are we to resolve the multiple virtual constructors one can be saddled with when instantiating an instance of a subclass derived via polymorphic inheritance?
And the simple answer, of course, is that we don’t.
You will find OOP proponents who insist that it’s not OOP without a message-passing infrastructure, object-level scoping constructs, operator overloading, class inheritance rules, et cetera; and for each of these features, you will find an OOP proponent who insists that the feature is an irrelevance to true OOP.[18]
It is so much easier to just stop worrying about it. To develop central structures and build functions that make use of them, C syntax easily provides the 10% of the edifice that creates 90% of the structure.
The scope of a variable is the range of the code over which it exists and can be used. The rule of thumb for sane programming is to keep the scope of a variable as small as practicable, because doing so limits the number of variables you have to keep in mind at any given point, and means lower risk that a variable will be changed by code you didn’t bear in mind.
OK, here goes: all of the rules for variable scope in C.
A variable never has scope in the code before it is declared. That would be silly.
If a variable is declared somewhere inside a pair of curly
braces, then at the closing curly brace, the variable goes out of
scope. Semi-exception: for loops
and functions may have variables declared in a set of parens just
before their opening curly brace; variables declared within the
parens have scope as if they were declared inside the curly
braces.
If a variable isn’t inside any curly braces, then it has scope from its declaration to the end of the file.
You’re done.
There is no class scope, prototype scope, friend scope, namespace
scope, runtime environment rebinding, or special scoping keywords or
operators (beyond those curly braces, and arguably the linkage
specifiers static and extern). Does dynamic scoping confuse you?
Don’t worry about it. If you know where the curly braces are, you can
determine which variables can be used where.
Everything else is a simple corollary. For example, if
code.c has a line that will #include <header.h>, then the full text
of header.h is pasted into
code.c, and variables therein have scope
accordingly.
Functions are just another example of curly-brace scope. Here is a sample function to sum all the integers up to the input number:
int sum (int max){
int total=0;
for (int i=0; i<= max; i++){
total += i;
}
return total;
}Then max and total have scope inside the function, by the
curly-brace rule and the semi-exception about how variables in parens
just before the curly brace act as if they are inside the braces. The
same holds with the for loop, and how
i is born and dies with the curly
braces of the for loop. If you have a
one-line for loop, you don’t have to
write the curly braces, like for (int i=0; i <= max; i++)
total += i;, but the scope of i is still limited to the loop.
Summary paragraph: C is awesome for having such simple scoping
rules, which effectively consist of finding the end of the enclosing
curly braces or the end of the file. You can teach the whole scoping
system to a novice student in maybe 10 minutes. For the experienced
author, the rule is more general than just the curly braces for
functions and for loops, so you can
use them for occasional additional scoping restrictions in exceptional
situations, as in the macro tricks in Cultivate Robust and Flourishing Macros.
So we’re cathartically throwing out all the additional rules and keywords that support very fine-grained scope control.
Could we implement private struct elements without the extra keywords? In typical OOP usage, “private” data is not encrypted by the compiler or otherwise seriously hidden: if you have the address of the variable (e.g., if you have its offset in the struct), you can point to it, look at it in the debugger, and modify it. To give the data that level of opacity, we have the technology.
An object will typically be defined via two files: the .c file with the details and the .h file to be included in other writing that makes use of the object. It is not unreasonable to think of the .c file as the private segment and the .h file as the public. For example, say we are set on keeping some elements of an object private. The public header might be:
typedef struct a_box_s {
int public_size;
void *private;
} a_box_s;The void pointer is basically
useless to other authors, because they don’t know what type to cast it
to. The private segment, a_box.c, would hold the
requisite typedef:
typedef struct private_box_s {
long double how_much_i_hate_my_boss;
char **coworkers_i_have_a_crush_on;
double fudge_factor;
} private_box_s;
//Given the typedef, we have no problem casting the private pointer to
//its desired type and making use here in a_box.c.
a_box_s *box_new(){
a_box_s *out = malloc(sizeof(a_box_s));
private_box_s *outp = malloc(sizeof(private_box_s));
*out = (a_box_s){.public_size=0, .private=outp};
return out;
}
void box_edit(a_box_s *in){
private_box_s *pb = in->private;
//now work with private variables, e.g.:
pb->fudge_factor *= 2;
}So it’s not all that hard to implement a private segment of a C struct, but I rarely see it used in real-world libraries. Few C authors seem to think that there’s serious benefit to doing so.
Here’s a sample of the much more common means of putting a private element within a public struct:
typedef struct {
int pub_a, pub_b;
int private_a, private_b; //Private: please do not use these.
} public_s;That is, document when something should not be used, and trust your users to not cheat. If your colleagues won’t follow an instruction as simple as this, then chain the coffee maker to the wall, because you’ve got problems bigger than a compiler can solve.
Functions are especially easy to make private: don’t put their
declaration in a header. Optionally, put the static keyword in front of the definition so
that readers know that the function is private.
My impression is that most folks think of integer division—that
3/2==1—as an annoyance. If I type in
3/2, I expect 1.5, darn it, not 1.
Indeed, this is an annoying gotcha to C and other
integer-arithmetic languages, and more broadly, it shows us the dangers
of operator overloading. O.o. is when an operator,
such as /, does something different
depending on the types involved. For two integer types, the slash
effectively does a divide-and-truncate operation, and for anything else,
it performs the usual division.
Recall the rule from Pointers Without malloc that
things that behave differently should look different. That’s the failure
of 3/2: integer division and
floating-point division behave differently, but look identical.
Confusion and bugs ensue.
Human language is redundant, which is a good thing, partly because it allows error correction. When Nina Simone says “ne me quitte pas” (which would translate word-for-word as “don’t leave me no”), it’s OK if you space out at the beginning, because “… me quitte pas” has the pas to indicate negation, and it’s OK if you space out at the end, because “ne me quitte …” has the ne to indicate negation.
Grammatical gender typically doesn’t have much real-world meaning, and sometimes objects will change depending on word choice. My favorite example is in Spanish, where el pene and la polla both refer to the same object, but the first is masculine and the second feminine. The real value to the gender is that it provides redundancy, forcing parts of the sentence to match, and thus adding clarity.
Programming languages avoid redundancy. We express negation
exactly once, typically with only one character (!). But programming languages do have genders,
where they’re called types. Generally, your verbs and your nouns need to
agree in type (as in Arabic, Hebrew, and Russian, among other
languages). With this added redundancy, you’d need matrix_multiply(a, b) when you have two
matrices, and complex_multiply(a, b)
when you have two complex numbers.
Operator overloading is about eliminating redundancy, writing
a * b whether you have a pair of
matrices, complex numbers, natural numbers, or sets. Here’s a snippet
from an excellent essay on the cost of that reduced redundancy: “When
you see the code i = j * 5; in C you
know, at least, that j is being
multiplied by five and the results stored in i. But if you see that same snippet of code in
C++, you don’t know anything. Nothing.”[19]. The problem is that you don’t know what * means until you look up the type for
j, look through the inheritance tree
for j’s type to determine
which version of * you mean, and then you can start over with
identifying i and how that relates to
=, given the type of j.
So there’s the trade-off to operator overloading: you’ve saved
space on the page, and didn’t have to type much of anything, but have
lost all redundant hints and checks that b is actually a list and not the vector you
thought it was.
The C custom is to closely follow the sort of gender-agreement rules I’d just described, for example:
//add two vectors in the GNU Scientific Library gsl_vector *v1, *v2; gsl_vector_add(v1, v2);
//Open a GLib I/O channel for reading at a given filename.
GError *e;
GIOChannel *f = g_io_channel_new_file("indata.csv", "r", &e);It’s more typing, and when you have 10 lines acting on the same structure, things start to look repetitive. Later, we’ll have some means of slightly reducing this redundancy.
C provides limited overloading support via the C11 _Generic keyword. The keyword evaluates to a
value based on the type of its input, which lets you write macros that
consolidate some types together.
We need type-generic functions when we have a proliferation of types. Some systems provide a voluminous number of precise types, but every new type is another moving part that we have to support. For example, the GNU Scientific Library provides a complex number type, a complex vector type, and a vector type—and then there’s the C complex type. One could reasonably multiply any of those four types together, which theoretically means we need sixteen functions. Example 11-1 lists several of these functions; if you are not a complex vector aficionado, it would be entirely reasonable to recognize this example as a hairy mess and move on to the part where we clean it up.
#include "cplx.h" //gsl_cplx_from_c99; see below.
#include <gsl/gsl_blas.h> //gsl_blas_ddot
#include <gsl/gsl_complex_math.h> //gsl_complex_mul(_real)
gsl_vector_complex *cvec_dot_gslcplx(gsl_vector_complex *v, gsl_complex x){
gsl_vector_complex *out = gsl_vector_complex_alloc(v->size);
for (int i=0; i< v->size; i++)
gsl_vector_complex_set(out, i,
gsl_complex_mul(x, gsl_vector_complex_get(v, i)));
return out;
}
gsl_vector_complex *vec_dot_gslcplx(gsl_vector *v, gsl_complex x){
gsl_vector_complex *out = gsl_vector_complex_alloc(v->size);
for (int i=0; i< v->size; i++)
gsl_vector_complex_set(out, i,
gsl_complex_mul_real(x, gsl_vector_get(v, i)));
return out;
}
gsl_vector_complex *cvec_dot_c(gsl_vector_complex *v, complex double x){
return cvec_dot_gslcplx(v, gsl_cplx_from_c99(x));
}
gsl_vector_complex *vec_dot_c(gsl_vector *v, complex double x){
return vec_dot_gslcplx(v, gsl_cplx_from_c99(x));
}
complex double ddot (complex double x, complex double y){return x*y;}
void gsl_vector_complex_print(gsl_vector_complex *v){
for (int i=0; i< v->size; i++) {
gsl_complex x = gsl_vector_complex_get(v, i);
printf("%4g+%4gi%c", GSL_REAL(x), GSL_IMAG(x), i < v->size-1 ? '\t' : '\n');
}
}
The cleanup happens in the header, Example 11-2. It
uses _Generic to select one of the functions from
Example 11-1 based on the input types. The first
argument (the ‟controlling expression”) is not evaluated, but is
simply checked for its type, and the value of the
_Generic statement is selected based on that type.
We want to select a function based on two types, so the first macro
picks which of the second or third macros to use.
_Generic to provide a simple
front-end to the mess (cplx.h)#include <complex.h> //nice names for C’s complex types
#include <gsl/gsl_vector.h> //gsl_vector_complex
gsl_vector_complex *cvec_dot_gslcplx(gsl_vector_complex *v, gsl_complex x);
gsl_vector_complex *vec_dot_gslcplx(gsl_vector *v, gsl_complex x);
gsl_vector_complex *cvec_dot_c(gsl_vector_complex *v, complex double x);
gsl_vector_complex *vec_dot_c(gsl_vector *v, complex double x);
void gsl_vector_complex_print(gsl_vector_complex *v);
#define gsl_cplx_from_c99(x) (gsl_complex){.dat= {creal(x), cimag(x)}}
complex double ddot (complex double x, complex double y);
#define dot(x,y) _Generic((x), \
gsl_vector*: dot_given_vec(y), \
gsl_vector_complex*: dot_given_cplx_vec(y), \
default: ddot)((x),(y))
#define dot_given_vec(y) _Generic((y), \
gsl_complex: vec_dot_gslcplx, \
default: vec_dot_c)
#define dot_given_cplx_vec(y) _Generic((y), \
gsl_complex: cvec_dot_gslcplx, \
default: cvec_dot_c)
gsl_complex and C99 complex double are
both a two-element array consisting of real double followed by
imaginary double [see the GSL manual and C99 & C11
§6.2.5(13)]. All we have to do is build the appropriate struct—and
a compound literal is the perfect way to build a struct on the
fly.
The first use of x is not actually
evaluated, just checked for its type. That means that a call like
dot(x++, y) would increment
x only once.
In Example 11-3, life is (mostly) easy again: we
can use dot to find the product of a
gsl_vector times a gsl_complex,
a gsl_vector_complex times a C complex, and so on
for a great many combinations. Of course, you still need to know the
output type, because the return value of a scalar times a scalar is a
scalar, not a vector, so the use of the output depends on the input
types. The proliferation of types is a fundamental problem, but the
_Generic facility at least provides a
band-aid.
dot (almost)
regardless of input types
(simple_cplx.c)#include <stdio.h>
#include "cplx.h"
int main(){
int complex a = 1+2I;
complex double b = 2+I;
gsl_complex c = gsl_cplx_from_c99(a);
gsl_vector *v = gsl_vector_alloc(8);
for (int i=0; i< v->size; i++) gsl_vector_set(v, i, i/8.);
complex double adotb = dot(a, b);
printf("(1+2i) dot (2+i): %g + %gi\n", creal(adotb), cimag(adotb));
printf("v dot 2:\n");
double d = 2;
gsl_vector_complex_print(dot(v, d));
printf("v dot (1+2i):\n");
gsl_vector_complex *vc = dot(v, a);
gsl_vector_complex_print(vc);
printf("v dot (1+2i) again:\n");
gsl_vector_complex_print(dot(v, c));
}
Declarations with complex are a bit like
declarations with const: both complex
int and int complex are
valid.
Finally, the payoff: this function will use the
dot function four times, each with different
input types.
Here are the C-native means of getting the real and imaginary parts of a complex number.
Here’s my own rule of thumb for overloading, via _Generic or whatever other means:
if users have no idea what the input type is, will they
still get the right answer? Observe that the overloading of
absolute value for int, float, and double work just fine with this
rule. Following the example, it might also make sense to overload
functions using the
gsl_complex and
the native C complex double.
Moving on from the syntactic tricks, let us get to the core problem of structuring our data.
In 1936, in response to a formal mathematical question (The Entscheidungsproblem), Alonso Church developed a lambda calculus, a formal means of describing functions and variables. In 1937, in response to the same question, Alan Turing described a formal language in the form of a machine with a tape holding data and a head that can be shifted along the tape to read and write to the tape. Later, Church’s lambda calculus and Turing’s machine were shown to be equivalent—any calculation you could express in one, you could express in the other. It’s been the same divide ever since, and Church’s and Turing’s constructions continue to be the root of how we structure our data.
The lambda calculus relies heavily on named lists; in lambda-inspired pseudocode, we might express a person’s information as:
(person (
(name "Sinead")
(age 28)
(height 173)
))With Turing’s machine, we would have a block of the tape set aside for the structure. The first few blocks would be a name, the next few would hold the age, and so on. Almost a century later, Turing’s tape is still a not-bad description of computer memory: recall from All the Pointer Arithmetic You Need to Know that this base-plus-offset form is exactly how C treats structures. We would write
typedef struct {
char * name;
double age, height;
} person;person sinead = {.name="Sinead", .age=28, .height=173};and sinead would point to a block
of memory, and sinead.height would
point to the tape immediately after name and age
(and after any padding for alignment purposes).
Here are some differences between the list approach and the block-of-memory approach:
Telling the computer to go to a certain offset from a
certain address is still among the fastest operations a machine can
execute. Your C compiler even does the translation from labels to
offsets during compile time. Conversely, finding something in the list
requires a lookup: given the label "age", which element in the list corresponds
and where is its data in memory? Every system has techniques to make
this as fast a lookup as possible, but a lookup will always be more
work than a simple base-plus-offset.
Adding a new element to a list is a much easier process than adding to a struct, which is basically fixed at compile time.
A C compiler can tell you at compile time that hieght is a typo, because it can look in the
struct’s definition and see that there is no such element. Because a
list is extensible, we won’t know that there is no hieght element until the program runs and checks on the list.
Those last two items demonstrate a direct tension: we want extensibility, wherein we can add elements to a structure; we want registration, wherein things that are not in the structure are flagged as errors. That’s a balance that has to be struck, and everybody implements controlled extension of an existing list differently.
C++, Java, and their siblings, have a syntax for producing a new
type that is an instance of the type to be extended, but that inherits
the old type’s elements. You still get base-plus-offset speed, and
compile-time checking, but at the price of voluminous paperwork; where C
has struct and its absurdly simple
scoping rules from Scope, Java has implements, extends, final, instanceof, class, this, interface, private, public, protected.
Perl, Python, and many LISP-inspired languages are based on named lists, so that is a natural means of implementing a structure. Extend the list by just adding elements to it. Pros: fully extensible by just adding a new named item. Cons: as previously, we don’t get registration, and although you can improve the name search via various tricks, you’re a long way from the speed of a single base-plus-offset step. Many languages in this family have a class definition system, so that you can register a certain set of list items and thus check whether future uses conform to the definition, which, when done right, provides a nice compromise between checking and ease of extension.
Getting back to plain old C, its structs are the fastest way to access a structure’s elements, and we get compile-time checking at the price of runtime extensibility. If you want a flexible list that can grow as the runtime need arises, you will need a list structure, such as the GLib’s hashes, or the sample dictionary described later.
All the machinery you have in C for extending a structure is to wrap it in another structure. Say that we have a type defined via a form such as:
typedef struct {
...
} list_element_s;which is already packaged and cannot be changed, but we’d like to add a type marker. Then we’ll need a new structure:
typedef struct {
list_element_s elmt;
char typemarker;
} list_element_w_type_s;Pros: this is so stupid easy, and you still get the speed bonus.
Cons: Now, every time you want to refer to the name of the element,
you’ll need to write out the full path, your_typed_list->elmt->name, instead of
what you’d get via a C++/Java-like extension: your_typed_list->name. Add a few layers to
this and it starts to get annoying. You already saw in Pointers Without malloc how using aliases can help here.
C11 allows us to include anonymous elements of a structure,
which make structs within structs easier to use. Although this got
added to the standard in December of 2011, it was a Microsoft
extension for a long time before then, and gcc already allows it given the --ms-extensions flag on the command line.
The syntax: include another struct type somewhere in the
declaration of the new structure, as per the point struct in Example 11-4, without a name for the element. Example 11-4 uses a bare structure type name, struct point, whereas a named declaration
would be something like struct
point ptelement. All of the
elements of the referred-to structure are included in the new
structure as if they were declared in place.
Example 11-4 extends a 2D point into a 3D
point. So far, it is only notable because the threepoint struct extends the point seamlessly, to the point where users
of the threepoint won’t even know
that its definition is based on another struct.
#include <stdio.h>
#include <math.h>
typedef struct point {
double x, y;
} point;
typedef struct {
struct point;
double z;
} threepoint;
double threelength (threepoint p){
return sqrt(p.x*p.x + p.y*p.y + p.z*p.z);
}
int main(){
threepoint p = {.x=3, .y=0, .z=4};
printf("p is %g units from the origin\n", threelength(p));
}
This is anonymous. The not-anonymous version would also have
a name like struct point
twopt.
The x and y elements of the point structure look and behave exactly
like the additional z element
of the threepoint.
Even the declaration gives no hint that x and y were inherited from an existing
structure.
It’s not in the standard to use a typedef in the nested
anonymous declaration. The standards committee explicitly rejected
this, thus going out of its way to produce one of the few points
in the C language where a typedef for a struct cannot substitute for
the struct definition.[20] Though, if you use the naming convention from Typedefs Save the Day, this just means that you need to put the word
struct in front of the name of
the structure type.
Now for the trick that really makes this useful. The original
object, the point, was probably
accompanied by several interface functions that are still potentially
useful, like a length function
measuring the distance between zero and the given point. How are we
going to use that function, now that we don’t have a name for that
subpart of the larger structure?
The solution is to use an anonymous union of a named point and an unnamed point. Being the union of two identical
structures, the two structures share absolutely everything, and the
only distinction is in the naming: use the named version when you need
to call functions that use the original struct as an input, and use
the anonymous version for seamless merging into the larger struct.
Example 11-5 rewrites Example 11-4
using this trick.
#include <stdio.h>
#include <math.h>
typedef struct point {
double x, y;
} point;
typedef struct {
union {
struct point;
point p2;
};
double z;
} threepoint;
double length (point p){
return sqrt(p.x*p.x + p.y*p.y);
}
double threelength (threepoint p){
return sqrt(p.x*p.x + p.y*p.y + p.z*p.z);
}
int main(){
threepoint p = {.x=3, .y=0, .z=4};
printf("p is %g units from the origin\n", threelength(p));
double xylength = length(p.p2);
printf("Its projection onto the XY plane is %g units from the origin\n", xylength);
}
This is an anonymous structure.
This is a named structure. Being part of a union, it is identical to the anonymous structure, differing only in having a name.
The point structure is
still seamlessly included in the threepoint structure, but...
...the p2 element is a
named element like it always was, so we can use it to call the
interface functions written around the original struct.
After the declaration threepoint
p, we can refer to the X coordinate via
p.x (because of the anonymous
struct) or via p.p2.x (because of
the named struct). The last line of the example shows the length when
projecting onto the XY plane, and does so using
length(p.p2).
From here, the possibilities span ℝ3. You can extend any structure and still use all of the functions associated with the original.
Did you notice this is the first time I’ve used the union keyword in this book? Unions are
another one of those things where the explanation takes a
paragraph—it’s like a struct, but
all of the elements occupy the same space—and then the caveats about
how to not hang yourself take up several pages. Memory is cheap, and
for writing applications, we don’t have to care about memory
alignment, so sticking to structs will reduce the possibility of
errors, even when only one element is used at a time.
Inheriting from multiple structures with this method is risky.
Pick one structure to be the base of the extension using the union
trick, and let the others be extended via the plain vanilla
substructure. For example, The GNU Scientific Library has matrix and
vector types (where struct
_gsl_matrix is typedeffed as gsl_matrix). Let us say that we want to put
both into a single structure:
typedefstruct{//Alas, this will fail.struct_gsl_vector;struct_gsl_matrix;}data_set;data_setd;
This looks innocuous, until you find out that the _gsl_vector and the _gsl_matrix both have an element named
data. When we refer to d.data, are we referring to the data element of the matrix or the vector? We
have no syntax for selective inclusion or renaming struct elements, so
the best we can do is pick the matrix or vector as primary and the
other as secondary, callable only by its subelement name:
typedef struct{ //A vector with supporting matrix.
struct _gsl_vector; //anonymous and seamless.
struct _gsl_matrix matrix; //named
} data_set;
data_set d;Earlier, I mentioned differences between a C struct and a
LISP-like dictionary of named elements. But a dictionary is an easy
structure to generate, given what we have in struct-based C. Doing so is
a fine chance to try building some objects. Please note, however, that
fleshing this out and making it bulletproof has already been done by
other authors; see the GLib’s keyed data tables or GHashTable, for example. The point here is
simply that having compound structs plus simple arrays equals a short
hop to a dictionary object.
We’re going to start with a simple key/value pair. Its mechanism will be in keyval.c. The header in Example 11-6 lists the structure and its interface functions.
typedefstructkeyval{char*key;void*value;}keyval;keyval*keyval_new(char*key,void*value);keyval*keyval_copy(keyvalconst*in);voidkeyval_free(keyval*in);intkeyval_matches(keyvalconst*in,charconst*key);
Those of you with experience in traditional object-oriented
programming languages will find this to be very familiar. The first
instinct when establishing a new object is to write down new/copy/free
functions, and that is what the example does. After that, there are
typically a few structure-specific functions, such as the keyval_matches function to check whether the
key in a keyval pair matches the
input string.
Having new/copy/free functions mean that your memory management
worries are brief: in the new and copy functions, allocate the
memory with malloc; in the free
function, remove the structure with free, and having set up these functions, code
that uses the object will never use malloc or free on them, but will trust that keyval_new, keyval_copy, and keyval_free will do all the memory management
correctly.
#include <stdlib.h> //malloc
#include <strings.h> //strcasecmp
#include "keyval.h"
keyval *keyval_new(char *key, void *value){
keyval *out = malloc(sizeof(keyval));
*out = (keyval){.key = key, .value=value};
return out;
}
/** Copy a key/value pair. The new pair has pointers to
the values in the old pair, not copies of their data. */
keyval *keyval_copy(keyval const *in){
keyval *out = malloc(sizeof(keyval));
*out = *in;
return out;
}
void keyval_free(keyval *in){ free(in); }
int keyval_matches(keyval const *in, char const *key){
return !strcasecmp(in->key, key);
}Now that we have an object representing a single key/value pair, we can move on to establishing a dictionary as a list of these. Example 11-8 provides the header.
#include "keyval.h" extern void *dictionary_not_found;typedef struct dictionary{ keyval **pairs; int length; } dictionary; dictionary *dictionary_new (void); dictionary *dictionary_copy(dictionary *in); void dictionary_free(dictionary *in); void dictionary_add(dictionary *in, char *key, void *value); void *dictionary_find(dictionary const *in, char const *key);
As you can see, you get the same new/copy/free functions, plus a few other dictionary-specific functions, and a marker to be described later. Example 11-9 provides the private implementation.
#include <stdio.h>
#include <stdlib.h>
#include "dict.h"
void *dictionary_not_found;
dictionary *dictionary_new (void){
static int dnf;
if (!dictionary_not_found) dictionary_not_found = &dnf;
dictionary *out= malloc(sizeof(dictionary));
*out= (dictionary){ };
return out;
}
static void dictionary_add_keyval(dictionary *in, keyval *kv){
in->length++;
in->pairs = realloc(in->pairs, sizeof(keyval*)*in->length);
in->pairs[in->length-1] = kv;
}
void dictionary_add(dictionary *in, char *key, void *value){
if (!key){fprintf(stderr, "NULL is not a valid key.\n"); abort();}
dictionary_add_keyval(in, keyval_new(key, value));
}
void *dictionary_find(dictionary const *in, char const *key){
for (int i=0; i< in->length; i++)
if (keyval_matches(in->pairs[i], key))
return in->pairs[i]->value;
return dictionary_not_found;
}
dictionary *dictionary_copy(dictionary *in){
dictionary *out = dictionary_new();
for (int i=0; i< in->length; i++)
dictionary_add_keyval(out, keyval_copy(in->pairs[i]));
return out;
}
void dictionary_free(dictionary *in){
for (int i=0; i< in->length; i++)
keyval_free(in->pairs[i]);
free(in);
}
It is reasonable to have a NULL value in the key/value table, so we
need a unique marker to indicate a missing value.
Recall that a function marked as static cannot be used outside the file, so
this is one more reminder that the function is private to the
implementation.
I cheated again: using abort like this is bad form. It would be
better to use a macro like the one in stopif.h
(Example 10-2). I did it this way
to demonstrate a feature of the test harness in Unit Testing.
Now that we have a dictionary, Example 11-10 can use it without thinking about memory management, which the new/copy/free/add functions take care of, and without making reference to key/value pairs, because that is one level too low for our purposes.
#include <stdio.h>#include "dict.h"intmain(){intzero=0;floatone=1.0;chartwo[]="two";dictionary*d=dictionary_new();dictionary_add(d,"an int",&zero);dictionary_add(d,"a float",&one);dictionary_add(d,"a string",&two);printf("The integer I recorded was: %i\n",*(int*)dictionary_find(d,"an int"));printf("The string was: %s\n",(char*)dictionary_find(d,"a string"));dictionary_free(d);}
So writing a struct and its new/copy/free and other associated functions was enough to give us the right level of encapsulation: the dictionary didn’t have to care about the internals of the key/value pair, and the application didn’t have to worry about dictionary internals.
The boilerplate code is not as bad as it is in some languages, but there is certainly some repetition to the new/copy/free functions. And as the examples continue, you’ll see this boilerplate several times more.
At some point, I even wrote myself macros to autogenerate these. For example, the copy functions differ only in dealing with internal pointers, so we could write a macro to automate all the boilerplate not about internal pointers:
#define def_object_copy(tname, ...) \
void * tname##_copy(tname *in) { \
tname *out = malloc(sizeof(tname)); \
*out = *in; \
__VA_ARGS__; \
return out; \
}
def_object_copy(keyval) //Expands to the previous declarations of keyval_copy.But the redundancy is nothing to worry about all that much. Despite our mathematical æsthetic of minimizing repetition and text on the page, sometimes having more code really does make the program more readable and robust.
Why did I base all of these things on pointers to data structures, instead of just passing around data structures directly? If you use a plain struct, the new/copy/free functions write themselves:
Use designated initializers on the first line where you need a struct. As an added plus, structures can be declared at compile time, so they are immediately available to users without an initial call to a setup function.
The equals sign does this.
Don’t bother; it’ll go out of scope soon enough.
So we’re making things more difficult for ourselves with pointers. Yet from what I’ve seen, there’s consensus on using pointers to objects as the base of our designs.
Pros to using pointers:
Copying a single pointer is cheaper than copying a full structure, so you save a microsecond on every function call with a struct as an input. Of course, this only adds up after a few billion function calls.
Data structure libraries (your trees and linked lists, for example) are all written around hooks for a pointer.
Now that you’re filling a tree or a list, having the system automatically free the struct at the end of the scope in which it was created might not be what you want.
Many of your functions that take in a struct will modify the struct’s contents, meaning that you’ve got to pass a pointer to the struct anyway. Having some functions that take in the struct itself and some that take in a pointer to struct is confusing (I have written an interface like this and I regret it), so you might as well just send a pointer every time.
If the contents of the struct include a pointer to data elsewhere, then the convenience bonus from using a plain struct evaporates anyway: if you want a deep copy (wherein the data pointed to is copied, not just the pointer) then you need a copy function, and you will probably want a free function to make sure the internal data is eliminated.
There’s no one-size-fits-all set of rules for using structs. As your project gets bigger, and a throwaway struct grows into a core of how your data is organized, the benefits to pointers wax and the benefits to nonpointers wane.
A struct can include functions among its member elements as easily as it can hold typical variables.
Given a struct object_s and a function
fn that is somehow related to it, we can go the
usual route of leaving the function outside of the structure but give it a
name that indicates a relationship,
object_fn(...), or we could let
fn be an element of an
object_s, so that when we declare
object_s xo, we would call the function via
xo.fn(...).
This is, for the most part, a stylistic choice, affecting how we
look up functions in the documentation and how the code looks on the page.
The documentation issue, by the way, is why I prefer the
object_fn naming scheme over the
fn_object scheme: with the first form, the
documentation’s index lists all of object_s’s
associated functions in one place.
The real advantage of the element-of-struct form is that you can
more easily change the function associated with every instance of the
object: given the declaration object_s xo, yo and a function element of the struct
named add, then xo.add and yo.add could be entirely different functions. On
the plus side, you can send xo and
yo to a routine with a header do_math(object_s
in), and it can call in.add somewhere in there without any regard to
what add actually does. On the minus
side, we run risk of once again breaking the rule that things that do
different things should look different: if the function xo.add has subtly different side effects from
yo.add, you have no warnings.
This is why I generally prefer the
object_fn form for functions related to
objects. If there are two or three similar but distinct types of operation
on the same object, I can give them different names (like dictionary_add and dictionary_add_keval in Example 11-9). I reserve the
xo.fn form for when I expect that almost every
instance of object_s will have a different
version of fn.
It’s hard to come by situations where every instance of an object
has different methods associated, which is the point where having methods
inside of an object really starts to make a difference. I’ve had the good
fortune of running across such a situation, when writing a library of
statistical models, named Apophenia (see The GNU Scientific Library). This segment will present a
model_s object along similar lines, and
demonstrate the details of hooking different procedures to every object.
I’ve written this section so that if you read all of the mathematics and
statistical jargon as blahblahblah, then you should
be able to follow along just fine.
Broadly, statistical models are all the same: they have some procedure, like the equation for a bell curve or a story about a line of best fit, and some parameters that fine-tune the procedure to the data. For example, we might specify that the data is from a bell-curve Normal distribution, whose tuning parameters are a mean (μ) and a standard deviation (σ). For the input data [1, 2, 3, 4, 5], the mean is three (because it’s symmetric around three), and if you do the math, you’ll find that the sample σ is about 1.58. The black box representing the estimation procedure would take in that data and spit out (μ, σ) = (3, 1.58). If I give the black box [100, 100.2, 100.8, 100.7, 100.4], then it will spit out a larger μ and smaller σ: (μ, σ) = (100.42, 0.33).
You might be familiar with long tail distributions, like how a handful of books sell millions of copies but thousands of books only sell a few dozen copies. This is hardly a bell curve where everything drifts around an average sales figure, and the calculations for estimating the parameter for a Zipf distribution (one of the more common formalizations of the long tail idea) have nothing to do with the math for a bell curve. The black box has a similar form, though: give it [1, 2, 3, 4, 5], and it spits out β=1.7; give it [100, 100.2, 100.8, 100.7, 100.4] and it spits out β=1.2.
So we’ve hit heterogeneity: the model struct should have elements
named parameters and data, and there should be an estimate function to go from data to parameters,
but it will be a different function for every model.
The implementation of this is not much of a stretch from the typical
structure implementation. Assume a typedef for data_s appropriate to describe a data set; then
we have enough to declare Normal and Zipf objects. Here is a mock-up of
the key parts of the process:
typedef model_s * (*data_to_estimated_model)(model_s *, data_s *);typedef struct model_s { data_s *parameters, *data; data_to_estimated_model estimate;
} model_s; //Takes in a copy of the input model with NULL parameters. //Returns a copy of the input model with the parameters calculated. static model_s* normal_est(model_s *in, data_s *d){ model_s *outmodel = model_copy(in);
//math goes here to calculate outmodel->parameters; return outmodel; } static model_s* zipf_est(model_s *in, data_s *d){ model_s *outmodel = model_copy(in); //entirely different math goes here to set outmodel->parameters; return outmodel; } model_s *normal_model = &(model_s){.estimate = normal_est};
model_s *zipf_model = &(model_s){.estimate = zipf_est}; data_s *data = text_to_data("infile");
model_s *norm_est = normal_model->estimate(normal_model, data); model_s *zipf_est = zipf_model->estimate(zipf_model, data);
As per Typedef as a teaching tool, a good typedef makes life with function pointers much more pleasant.
We can put a pointer to a function into a struct as easily as we include any other pointer. Now all that’s left to do is establish what function any given function pointer will point to.
I assume that model_copy is defined
elsewhere. The copy function is the same for every model, so there’s
no major benefit to putting it inside the structure.
Here, two model_s structures are initialized,
via designated initializers. The estimate pointers
in the two models each point to a different function.
An imaginary sample use of the two models to produce versions with parameters estimated.
By the last two lines, we are on our way to having a uniform
interface to entirely distinct functions. You could picture a function
that takes in data and a model_s, names
it m, and calls m.estimate(m, data).
Note the use of the static
keyword, which indicates that outside of this file, no code will be able
to call normal_est or zipf_est by those names. They will, however, be
able to use the names normal_mode.estimate and zipf_model.estimate to call them.
There are a few bells and whistles that we’d like to add. First,
zipf_model.estimate(zipf_model, data)
is a redundant form, with a repetition of zipf_model. It would be nice to be able to write
zipf_model.estimate(data) and let the
system just know that the first argument should be the object making the
call. The function might see a special variable named this or self,
or we could add a special-case rule that
object.fn(x) gets reshuffled to
fn(object, x).
Sorry, but it’s not going to happen in C.
C doesn’t define magic variables for you, and it is always honest
and transparent about what parameters get sent in to a function. Normally,
if we want to futz around with the parameters of a function, we do it with
the preprocessor, which will gladly rewrite f(anything) to f(anything else). However, all of the
transformations happen to what goes on inside of the parens. There’s no
way to get the preprocessor to transform the text s.prob(d) to s.prob(s,
d). If you don’t want to slavishly imitate C++-type syntax, you
can write macros like:
#define Estimate(in, ...) (in).estimate((in), __VA_ARGS__) #define Copy(in, ...) (in).copy((in), __VA_ARGS__) #define Free(in, ...) (in).free((in), __VA_ARGS__)
But now you’ve cluttered up the global namespace with the Estimate, Copy, and Free symbols. Maybe it’s worth it to you
(especially given that every function should have associated copy and free
functions).
You could keep the namespace organized and prevent name collisions by naming your macros appropriately:
#define Model_estimate(in, ...) (in).estimate((in), __VA_ARGS__) #define Model_copy(in, ...) (in).copy((in), __VA_ARGS__)
My preferred alternative is a dispatch function, a thin wrapper
function that solves this redundancy annoyance and provides one more
benefit over the simple macros shown: defaults. Let us say that we have no
estimation routine for a given model. Given a log likelihood function we
could use maximum likelihood techniques to estimate the parameters (that
is, blahblahblah). The default function would look
much like the specific estimation routines earlier. Here is a mock-up,
making use of a presumed log_likelihood
method added to the struct:
model_s default_estimation(model_s *in, data_s *d){
model_s *outmodel = _model_copy(in);
//math making use of additional in->log_likelihood element here
return outmodel;
}Now for the dispatch function:
model_s * model_estimate(model_s *p, data_s *d){
//Place error checking here.
if (p->estimate) return p->estimate(p, d);
else return default_estimation(p, d);
}Usage:
model_s normal_model = {.estimate = normal_est};
model_s ad_hoc_model = {.log_likelihood = ad_hoc_ll};model_s *norm_est = model_estimate(&normal_model, data); model_s *adhoc_est = model_estimate(&ad_hoc_model, data);
We have achieved a homogeneous form out of still more heterogeneous
parts—a model structure’s estimate function could be NULL and we can still call the same model_estimate function,
though the default will need to have a .log_likelihood element.
So dispatch functions gave us default routines, solved the annoyance
of not having a magic this or self variable, and did so in a manner that looks
similar to the usual interface functions like model_copy or model_free.
There are other ways to do it. Earlier, I used designated
initializers to set up the functions, so unspecified elements are NULL and a dispatch function makes sense. If we
required that users always use a model_new function, then we could set the
default functions there. Then eliminating the redundancy of
mymodel.estimate(mymodel,
data) can be done by a simple macro, as
previously.
Once again, you’ve got options. We already have more than enough syntactic tools to uniformly call diverse functions for diverse objects. That just leaves the hard part of writing those diverse functions so that calling them in a uniform manner always behaves as expected.
The remainder of this chapter presents a few more examples of building objects, and how we can modify the boilerplate new/copy/free functions to handle nontrivial situations. As already noted, the examples are going to be more involved, taking into account more real-world considerations and doing something interesting.
The first example presents a small library that has one structure to speak of, which is intended to read an entire file into a single string. Having all of Moby Dick in a single string in memory is not a big deal at all, but having a thousand copies of it floating around starts to be wasteful. So instead of copying the potentially very long data string, we’ll have views that just mark different start and end points.
Now that we have several views of the string, we need to free the string exactly once, when the string no longer has any views attached. Thanks to the object framework, it’s easy to make this happen.
The second example, an agent-based microsimulation of group formation, has a similar problem: the groups should exist as long as they have members, and need to be freed if and when the last member leaves.
The trick to having a lot of objects pointing to the same string is to add a reference count element to the structure. Modify the four boilerplate elements as follows:
The type definition includes a pointer-to-integer named
refs. It will be set up only once
(via the new function), and all
copies (made via the copy
function) will share the string and this reference counter.
The new function sets up
the refs pointer and sets
*refs = 1.
The copy function copies
the original struct into the output copy and increments the
reference count.
The free function
decrements the reference count and, if it has hit zero, frees the
shared string.
Example 11-11 provides the header for the string manipulation example, fstr.h, which introduces the key structure representing a segment of a string and an auxiliary structure representing a list of these string segments.
#include <stdio.h>
#include <stdlib.h>
#include <glib.h>
typedef struct {
char *data;
size_t start, end;
int* refs;
} fstr_s;
fstr_s *fstr_new(char const *filename);
fstr_s *fstr_copy(fstr_s const *in, size_t start, size_t len);
void fstr_show(fstr_s const *fstr);
void fstr_free(fstr_s *in);
typedef struct {
fstr_s **strings;
int count;
} fstr_list;
fstr_list fstr_split (fstr_s const *in, gchar const *start_pattern);
void fstr_list_free(fstr_list in);
I hope these headers are starting to get boring for you. It’s just the same typdef/new/copy/free over and over again.
The fstr_list struct had
originally been intended to be a throwaway, not quite a full object,
but I found myself using it to organize much of the code. Such
accidental falling upon structure is a good thing, and we should
encourage it. Notice that the fstr_split function returns the list, not
a pointer to the list.
Example 11-12 shows the library,
fstr.c. It uses GLib to read in the text file and
for Perl-compatible regular expression parsing. The numbered callouts
focus on the steps at the head of this section, so you can follow them
to trace the use of the refs element
to implement reference counting.
#include "fstr.h"
#include "string_utilities.h"
fstr_s *fstr_new(char const *filename){
fstr_s *out = malloc(sizeof(fstr_s));
*out = (fstr_s){.start=0, .refs=malloc(sizeof(int))};
out->data = string_from_file(filename);
out->end = out->data ? strlen(out->data) : 0;
*out->refs = 1;
return out;
}
/** Make a new fstr_s that is a substring of the input fstr_s.
\param in The parent string.
\param start The offset in the original string where the substring starts.
\param len The length of the substring. If longer than the available text,
the substring will only go to the end of the parent string.
*/
fstr_s *fstr_copy(fstr_s const *in, size_t start, size_t len){
fstr_s *out = malloc(sizeof(fstr_s));
*out=*in;
out->start += start;
if (in->end > out->start + len)
out->end = out->start + len;
(*out->refs)++;
return out;
}
void fstr_free(fstr_s *in){
(*in->refs)--;
if (!*in->refs) {
free(in->data);
free(in->refs);
}
free(in);
}
/** Split an input string into a sequence of substrings
\param in The input string to split.
\param start_pattern The regex marking the beginning of a new substring.
\return A list of substrings.
*/
fstr_list fstr_split (fstr_s const *in, gchar const *start_pattern){
fstr_s **out=malloc(sizeof(fstr_s*));
int outlen = 1;
out[0] = fstr_copy(in, 0, in->end);
GRegex *start_regex = g_regex_new (start_pattern, 0, 0, NULL);
gint mstart=0, mend=0;
fstr_s *remaining = fstr_copy(in, 0, in->end);
do {
GMatchInfo *start_info;
g_regex_match(start_regex, &remaining->data[remaining->start],
0, &start_info);
g_match_info_fetch_pos(start_info, 0, &mstart, &mend);
g_match_info_free(start_info);
if (mend > 0 && mend < remaining->end - remaining->start){
out = realloc(out, ++outlen * sizeof(fstr_s*));
out[outlen-1] = fstr_copy(remaining, mend, remaining->end-mend);
out[outlen-2]->end = remaining->start + mstart;
remaining->start += mend;
} else break;
} while (1);
fstr_free(remaining);
g_regex_unref(start_regex);
return (fstr_list){.strings=out, .count=outlen};
}
void fstr_list_free(fstr_list in){
for (int i=0; i < in.count; i++) fstr_free(in.strings[i]);
free(in.strings);
}
void fstr_show(fstr_s const *fstr){
printf("%.*s", (int)fstr->end-fstr->start, &fstr->data[fstr->start]);
}
For a new fstr_s, the
reference count is set to one. Otherwise, this function is a
boilerplate new object function.
The copy function copies the fstr_s sent in, and sets the start and
endpoints to the substring given (making sure that the endpoint
doesn’t go past the endpoint of the input fstr_s).
Here’s where the reference count gets incremented.
Here’s where the reference count gets used, to determine whether the base data should be freed or not.
Else, no match or out of bounds.
And finally, an application. To make this work, you’ll need a copy of Moby Dick, or the Whale, by Herman Melville. If you don’t have a copy on your drive, try Example 11-13 to download one from Project Gutenberg.
if[!-emoby];thencurl-A"Mozilla/4.0"http://www.gutenberg.org/cache/epub/2701/pg2701.txt \| sed -e '1,/START OF THIS PROJECT GUTENBERG/d' \| sed -e '/End of Project Gutenberg/,$d' \> mobyfi
Now that you have a copy of the book, Example 11-15 splits it into chapters and uses the same
splitting function to count the uses of the words
whale(s) and I in each
chapter. Notice that the fstr structs
can be used as opaque objects at this point, using only the new, copy,
free, show, and split functions.
The program requires GLib, fstr.c, and the string utilities from earlier in the book, so the basic makefile is now as in Example 11-14.
P=cetologyCFLAGS=`pkg-config--cflagsglib-2.0`-g-Wall-std=gnu99-O3LDLIBS=`pkg-config--libsglib-2.0`objects=fstr.ostring_utilities.o$(P):$(objects)
#include "fstr.h"intmain(){fstr_s*fstr=fstr_new("moby");fstr_listchapters=fstr_split(fstr,"\nCHAPTER");for(inti=0;i<chapters.count;i++){fstr_listfor_the_title=fstr_split(chapters.strings[i],"\\.");fstr_show(for_the_title.strings[1]);fstr_listme=fstr_split(chapters.strings[i],"\\WI\\W");fstr_listwhales=fstr_split(chapters.strings[i],"whale(s|)");fstr_listwords=fstr_split(chapters.strings[i],"\\W");printf("\nch %i, words: %i.\tIs: %i\twhales: %i\n",i,words.count-1,me.count-1,whales.count-1);fstr_list_free(for_the_title);fstr_list_free(me);fstr_list_free(whales);fstr_list_free(words);}fstr_list_free(chapters);fstr_free(fstr);}
To give you incentive to try the program, I won’t reprint the results in detail. But I will give some notes, which generally point to how hard it would be for Mr. Melville to publish or even blog the book here in the modern day:
Chapter lengths range by an order of magnitude.
Whales don’t get discussed all that much until around Chapter 30.
The narrator decidedly has a voice. Even in the famed cetology chapter, he uses the first person singular 60 times, personalizing what would otherwise be an encyclopedia chapter.
GLib’s regex parser is a little slower than I’d hoped it’d be.
This example is an agent-based model of group membership. Agents are on a two-dimensional preference space (because we’ll plot the groups) in the square between (-1, -1) and (1, 1). At each round, agents will join the group with the best utility to the agent. An agent’s utility from a group is -(distance to group’s mean position + M*number of members). The group’s mean position is the mean of the positions of the group’s members (excluding the agent querying the group), and M is a constant that scales how much the agents care about being in a large group relative to how much they care about the group’s mean position: if M is near zero, then size of group is basically irrelevant, and agents care only about proximity; as M goes to infinity, position becomes irrelevant, and only group size matters.
With some random odds, the agent will originate a new group. However, because agents are picking a new group every period, the agent may abandon that newly originated group in the next period.
The problem of reference counting is similar, and the process is roughly similar for this case:
The type definition includes an integer named counter.
The new function sets
counter = 1.
The copy function sets
counter++.
The free function queries if(--counter==0), and if yes, then
free all shared data; or else,
just leave everything as is, because we know there are still
references to the structure.
Again, as long as your changes to the structure are entirely via its interface functions, you don’t have to think about memory allocation when using the object at all.
The simulation takes almost 125 lines of code, and because I used CWEB to document it, the code files total almost double that length (where I gave some tips on reading and writing CWEB in Literate Code with CWEB). Given the literate coding style, this should be very readable; even if you’re in the habit of skipping big blocks of code, maybe give it a skim. If you have CWEB on hand, you can generate the PDF documentation and try reading it in that format.
The output from this program is intended to be piped to Gnuplot, a
plotting program that stands out for being easy to automate. Here is a
command-line script that uses a here document to pipe the given text to
Gnuplot, including a series of data points (with an e to mark the end of the series).
cat<<"------"|gnuplot--persistsetxlabel"Year"setylabel"U.S. Presidential elections"setyrange[0:5]setkeyoffplot'-'withboxes2000,12001,02002,02003,02004,12005,0e------
You can probably already picture producing commands to Gnuplot
programmatically, via a printf or two
for the plot settings, and a for loop
to output the data set. Further, sending a series of plots to Gnuplot
generates an animation sequence.
The simulation below produces an animation like this, so you can
run the simulation via ./groups |
gnuplot to display the animation on-screen. It’s hard to print
an animation, so you’ll have to run it yourself. You will see that, even
though such behavior was not programmed into the simulation, new groups
cause nearby groups to shift, producing an evenly-spaced, uniform
distribution of group positions. Political scientists have often
observed similar behavior in the space of political party positions:
when new parties enter, existing parties adjust their positions
accordingly.
Now for the header. What I call the join and exit functions might
more commonly be read as the copy and free functions. The group_s structure has a size element, which is the number of group
members—the reference count. You can see that I use Apophenia and GLib.
Notably, the groups are held in a linked list, private to the
groups.c file; maintaining that list will require
fully two lines of code, including a call to g_list_append and g_list_remove (Example 11-16).
group_s object.
(groups.h)#include <apop.h>#include <glib.h>typedefstruct{gsl_vector*position;intid,size;}group_s;group_s*group_new(gsl_vector*position);group_s*group_join(group_s*joinme,gsl_vector*position);voidgroup_exit(group_s*leaveme,gsl_vector*position);group_s*group_closest(gsl_vector*position,doublemb);voidprint_groups();
Now for the file defining the details of the group object (shown in Example 11-17).
group_s object.
(groups.w)@Hereintheintroductorymaterial,weincludetheheaderandspecifythegloballistofgroupsthattheprogrammakesuseof.We'llneednew/copy/freefunctionsforeachgroup.@c#include "groups.h"GList*group_list;@<newgroup@>@<copygroup@>@<freegroup@>@Thenewgroupmethodisboilerplate:we|malloc|somespace,fillthestructusingdesignatedinitializers,andappendthenewly-formedgrouptothelist.@<newgroup@>=group_s*group_new(gsl_vector*position){staticintid=0;group_s*out=malloc(sizeof(group_s));*out=(group_s){.position=apop_vector_copy(position),.id=id++,.size=1};group_list=g_list_append(group_list,out);returnout;}@Whenanagentjoinsagroup,thegroupis`copied'totheagent,butthereisn'tanymemorybeingcopied:thegroupissimplymodifiedtoaccommodatethenewperson.Wehavetoincrementthereferencecount,whichiseasyenough,andthenmodifythemeanposition.Ifthemeanpositionwithoutthe$n$thpersonis$P_{n-1}$,andthe$n$thpersonisatposition$p$,thenthenewmeanpositionwiththeperson,$P_n$istheweightedsum.$$P_n=\left((n-1)P_{n-1}/n\right)+p/n.$$Wecalculatethatforeachdimension.@<copygroup@>=group_s*group_join(group_s*joinme,gsl_vector*position){intn=++joinme->size;//increment the reference countfor(inti=0;i<joinme->position->size;i++){joinme->position->data[i]*=(n-1.)/n;joinme->position->data[i]+=position->data[i]/n;}returnjoinme;}@The`free'functionreallyonlyfreesthegroupwhenthereferencecountiszero.Whenitisn't,thenweneedtorunthedata-augmentingformulaforthemeaninreversetoremoveaperson.@<freegroup@>=voidgroup_exit(group_s*leaveme,gsl_vector*position){intn=leaveme->size--;//lower the reference countfor(inti=0;i<leaveme->position->size;i++){leaveme->position->data[i]-=position->data[i]/n;leaveme->position->data[i]*=n/(n-1.);}if(leaveme->size==0){//garbage collect?gsl_vector_free(leaveme->position);group_list=g_list_remove(group_list,leaveme);free(leaveme);}}@Iplayedaroundalotwithdifferentrulesforhowexactlypeopleevaluatethedistancetothegroups.Intheend,Iwoundupusingthe$L_3$norm.Thestandarddistanceisthe$L_2$norm,akaEuclidiandistance,meaningthatthedistancebetween$(x_1,y_1)$and$(x_2,y_2)$is$\sqrt{(x_1-x_2)^2+(y_1-y_2)^2}$.Thisis$L_3$,$\sqrt[3]{(x_1-x_2)^3+(y_1-y_2)^3}$.Thisandthecallto|apop_copy|abovearetheonlycallstotheApophenialibrary;youcouldwritearoundthemifyoudon'thavethatlibraryonhand.@<distance@>=apop_vector_distance(g->position,position,.metric='L',.norm=3)@By`closest',Imeanthegroupthatprovidesthegreatestbenefit,byhavingthesmallestdistanceminusweightedsize.Giventheutilityfunctionrepresentedbythe|dist|line,thisisjustasimple|for|looptofindthesmallestdistance.@cgroup_s*group_closest(gsl_vector*position,doublemass_benefit){group_s*fave=NULL;doublesmallest_dist=GSL_POSINF;for(GList*gl=group_list;gl!=NULL;gl=gl->next){group_s*g=gl->data;doubledist=@<distance@>-mass_benefit*g->size;if(dist<smallest_dist){smallest_dist=dist;fave=g;}}returnfave;}@Gnuplotisautomation-friendly.Herewegetananimatedsimulationwithfourlinesofplottingcode.Theheader|plot'-'|tellsthesystemtoplotthedatatofollow,thenwethe$(X,Y)$positions,onetoaline.Thefinal|e|indicatestheendofthedataset.ThemainprogramwillsetsomeinitialGnuplotsettings.@cvoidprint_groups(){printf("plot '-' with points pointtype 6\n");for(GList*gl=group_list;gl!=NULL;gl=gl->next)apop_vector_print(((group_s*)gl->data)->position);printf("e\n");}
Now that we have a group object and interface functions to add, join, and leave groups, the program file can focus on the simulation procedure: defining the array of persons followed by the main loop of rechecking memberships and printing out (Example 11-18).
group_s object
(groupabm.w)@*Initializations.@Thisisthepartoftheagent-basedmodelwiththehandlersforthe|people|structuresandtheprocedureitself.Atthispointallinterfacewiththegroupshappensviathenew/join/exit/functionsfrom|groups.cweb.c|.Thus,thereiszeromemorymanagementcodeinthisfile--thereferencecountingguaranteesusthatwhenthelastmemberexitsagroup,thegroupwillbefreed.@c#include "groups.h"intpop=2000,periods=200,dimension=2;@In|main|,we'llinitializeafewconstantsthatwecan'thaveasstaticvariablesbecausetheyrequiremath.@<setupmoreconstants@>=doublenew_group_odds=1./pop,mass_benefit=.7/pop;gsl_rng*r=apop_rng_alloc(1234);@*The|person_s|structure.@Thepeopleinthissimulationareprettyboring:theydonotdie,anddonotmove.Sothestructthatrepresentsthemissimple,withjust|position|andapointertothegroupofwhichtheagentiscurrentlyamember.@ctypedefstruct{gsl_vector*position;group_s*group;}person_s;@Thesetuproutineisalsoboring,andconsistsofallocatingauniformrandomvectorintwodimensions.@cperson_sperson_setup(gsl_rng*r){gsl_vector*posn=gsl_vector_alloc(dimension);for(inti=0;i<dimension;i++)gsl_vector_set(posn,i,2*gsl_rng_uniform(r)-1);return(person_s){.position=posn};}@*Groupmembership.@Attheoutsetofthisfunction,thepersonleavesitsgroup.Then,thedecisionisonlywhethertoformanewgrouporjoinanexistingone.@cvoidcheck_membership(person_s*p,gsl_rng*r,doublemass_benefit,doublenew_group_odds){group_exit(p->group,p->position);p->group=(gsl_rng_uniform(r)<new_group_odds)?@<formanewgroup@>:@<jointheclosestgroup@>;}@@<formanewgroup@>=group_new(p->position)@@<jointheclosestgroup@>=group_join(group_closest(p->position,mass_benefit),p->position)@*Settingup.@Theinitializationofthepopulation.UsingCWEB'smacros,itisatthispointself-documenting.@cvoidinit(person_s*people,intpop,gsl_rng*r){@<positioneverybody@>@<startwithtengroups@>@<everybodyjoinsagroup@>}@@<positioneverybody@>=for(inti=0;i<pop;i++)people[i]=person_setup(r);@Thefirsttenpeopleinourlistformnewgroups,butbecauseeverybody'spositionisrandom,thisisassigningthetengroupsatrandom.@<startwithtengroups@>=for(inti=0;i<10;i++)people[i].group=group_new(people[i].position);@@<everybodyjoinsagroup@>=for(inti=10;i<pop;i++)people[i].group=group_join(people[i%10].group,people[i].position);@*PlottingwithGnuplot.@ThisistheheaderforGnuplot.IarrivedatitbyplayingaroundonGnuplot'scommandline,thenwritingdownmyfinalpicksforsettingshere.@<theGnuplotheader@>=printf("unset key;set xrange [-1:1]\nset yrange [-1:1]\n");@Gnuplotanimationsimplyconsistsofsendingasequenceofplotstatements.@<plotoneanimationframe@>=print_groups();@*|main|.@The|main|routineconsistsofafewsetupsteps,andasimpleloop:calculateanewstate,thenplotit.@cintmain(){@<setupmoreconstants@>person_speople[pop];init(people,pop,r);@<theGnuplotheader@>for(intt=0;t<periods;t++){for(inti=0;i<pop;i++)check_membership(&people[i],r,mass_benefit,new_group_odds);@<plotoneanimationframe@>}}
[18]
“I once attended a Java user group meeting where James Gosling (Java’s inventor) was the featured speaker. During the memorable Q&A session, someone asked him: ‘If you could do Java over again, what would you change?’ ‘I’d leave out classes,’ he replied.”
—Allen Holub, Why extends is evil
[19] Originally at http://www.joelonsoftware.com/articles/Wrong.html; reprinted in [Spolsky 2008].
[20] The discussion is from David Keaton, “Clarifications to Anonymous Structures and Unions”, WG14/N1549, 22 December 2010, voted on and adopted by the Committee in March of 2011.