Node.js 8 the Right Way

Extracting Data from XML with Cheerio

At this point, you should have two successfully passing tests in test/parse-rdf-test.js that are powered by your module in lib/parse-rdf.js. In this section, we’ll expand the tests to cover all of our requirements for parsing Project Gutenberg RDF files, and implement the library code to make them pass.

To extract the data attributes we desire, we’ll need to parse the RDF (XML) file. As with everything in the Node.js ecosystem, there are multiple valid approaches to parsing, navigating, and querying XML files.

Let’s discuss some of the options, then move on to installing and using Cheerio.

Considering XML Data Extraction Options

In this chapter, we will be treating the RDF files like regular, undifferentiated XML files for parsing and for data extraction. The benefit to you (as opposed to addressing them specifically as RDF/XML) is that the skills and techniques you learn will transfer to parsing other kinds of XML and HTML documents.

For situations like this, where the documents are relatively small, I prefer to use Cheerio, a fast Node.js module that provides a jQuery-like API for working with HTML and XML documents.^[36] Cheerio’s big advantage is that it offers a convenient way to use CSS selectors to dig into the document, without the overhead of setting up a browser-like environment.

Cheerio isn’t the only DOM-like XML parser for Node.js—far from it. Other popular options include xmldom and jsdom,^[37] ^[38] both of which are based on the W3C’s DOM specification.

In your own projects, if the XML files that you’re working with are quite large, then you’re probably going to want a streaming SAX parser instead. SAX, which stands for Simple API for XML, treats XML as a stream of tokens that your program digests in sequence. Unlike a DOM parser, which considers the document as a whole, a SAX parser operates on only a small piece at a time.

Compared to DOM parsers, SAX parsers can be quite fast and memory-efficient. But the downside of using a SAX parser is that your program will have to keep track of the document structure in flight. I’ve had good experiences using the sax Node.js module for parsing large XML files.^[39]

Speaking of RDF/XML in particular, it’s a rich data format for which custom tooling is available. If you find yourself working with linked data in the wild, you may find it more convenient to convert it to JSON for Linked Data (JSON-LD) and then perform additional operations from there.

JSON-LD is to JSON as RDF is to XML.^[40] With JSON-LD, you can express relationships between entities, not just a hierarchical structure like JSON allows. The jsonld module would be a good place to start for this.^[41]

Which of these approaches is best for you really comes down to your use case and personal taste. If your documents are large, then you’ll probably want a SAX parser. If you need to preserve the structured relationships in the data, then JSON-LD may be best. Do you need to fetch remote documents? Some modules have this capability built in (Cheerio does not).

Our task at hand is to extract a small amount of data from relatively small files that are readily available locally. I find Cheerio to be an excellent fit for this particular kind of task, and I hope you will too!

Getting Started with Cheerio

To get started with Cheerio, install it with npm and save the dependency.

$ npm install --save --save-exact cheerio@0.22.0

Please be careful with the version number here. Cheerio has not historically followed the semantic versioning convention, introducing breaking changes in minor releases. If you install any version other than 0.22.0, the examples in this book may not work.

Before we start using Cheerio, let’s create some BDD tests that we can make pass by doing so. If Mocha is not already running continuously, open a terminal to your databases project directory and run the following:

$ npm run test:watch

It should clear the screen and report two passing tests:

2 passing (44ms)

Great; now let’s require that the book object returned by parseRDF has the correct numeric ID for The Art of War. Open your parse-rdf-test.js file and expand the second test by adding a check that the book object has an id property containing the number 132.

databases/test/parse-rdf-test.js

	it('should parse RDF content', () => {
	const book = parseRDF(rdf);
	expect(book).to.be.an('object');
	expect(book).to.have.a.property('id', 132);
	});

This code takes advantage of Chai’s sentence-like BDD API, which we’ll use in increasing doses as we add more tests.

Since we have not yet implemented the code to include the ‘id‘ in the returned ‘book‘ object, as soon as you save the file, your Mocha terminal should report this:

	1 passing (4ms)
	1 failing

	1) parseRDF should parse RDF content:
	AssertionError: expected {} to have a property 'id'
	at Context.it (test/parse-rdf-test.js:32:28)

Good! The test is failing exactly as we expect it should.

Now it’s time to use Cheerio to pull out the four fields we want: the book’s ID, the title, the authors, and the subjects.

Reading Data from an Attribute

The first piece of information we hope to extract using Cheerio is the book’s ID. Recall that we’re trying to grab the number 132 out of this XML tag:

<pgterms:ebook rdf:about="ebooks/132">

Open your lib/parse-rdf.js file and make it look like the following:

databases/lib/parse-rdf.js

	'use strict';
	const cheerio = require('cheerio');

	module.exports = rdf => {
	const $ = cheerio.load(rdf);

	const book = {};

	book.id = +$('pgterms\\:ebook').attr('rdf:about').replace('ebooks/', '');

	return book;
	};

This code adds three things to the version listed in Enabling Continuous Testing with Mocha:

At the top, we now require Cheerio.
Inside the exported function, we use Cheerio’s load method to parse the rdf content. The $ function that’s returned is very much like jQuery’s $ function.
Using Cheerio’s APIs, we extract the book’s ID and, finally, format it.

The line where we set book.id is fairly dense, so let’s break it down. Here’s the same line, but split out and commented so we can dissect it:

	book.id = // Set the book's id.
	+ // Unary plus casts the result as a number.
	$('pgterms\\:ebook') // Query for the <pgterms:ebook> tag.
	.attr('rdf:about') // Get the value of the rdf:about attribute.
	.replace('ebooks/', ''); // Strip off the leading 'ebooks/' substring.

In CSS, the colon character (:) has special meaning—it is used to introduce pseudo selectors like :hover for links that are hovered over. In our case, we need a literal colon character for the <pgterms:ebook> tag name, so we have to escape it with a backslash. But since the backslash is a special character in JavaScript string literals, that too needs to be escaped-. Thus, our query selector for finding the tag is pgterms\\:ebook.

Once we have selected the pgterms:ebook tag, we pull out the rdf:about attribute value and strip off the leading ebooks/ substring, leaving only the string "132". The leading unary plus (+) at the start of the line ensures that this gets cast as a number.

If all has gone well so far, your terminal running Mocha’s continuous testing should again read 2 passing.

Reading the Text of a Node

Next, let’s add a test for the title of the book. Insert the following code right after the test for the book’s ID.

databases/test/parse-rdf-test.js

expect(book).to.have.a.property('title', 'The Art of War');

Your continuous testing terminal should read as follows:

	1 passing (3ms)
	1 failing

	1) parseRDF should parse RDF content:
	AssertionError: expected { id: 132 } to have a property 'title'
	at Context.it (test/parse-rdf-test.js:35:28)

Now let’s grab the title and add it to the returned book object. Recall that the XML containing the title looks like this:

<dcterms:title>The Art of War</dcterms:title>

Getting this content is even easier than extracting the ID. Add the following to your parse-rdf.js file, after the line where we set book.id:

databases/lib/parse-rdf.js

book.title = $('dcterms\\:title').text();

Using Cheerio, we select the tag named dcterms:title and save its text content to the book.text property. Once you save this file, your tests should pass again.

Collecting an Array of Values

Moving on, let’s add tests for the array of book authors. Open your parse-rdf-test.js file and add these lines:

databases/test/parse-rdf-test.js

	expect(book).to.have.a.property('authors')
	.that.is.an('array').with.lengthOf(2)
	.and.contains('Sunzi, active 6th century B.C.')
	.and.contains('Giles, Lionel');

Here we really start to see the expressive power of Chai assertions. This line of code reads almost like an English sentence.

Expect book to have a property called authors that is an array of length two and contains “Sunzi, active 6th century B.C.” and “Giles, Lionel”.

In Chai’s language-chaining model, words like and, that, and which are largely interchangeable. This lets you write clauses like .and.contains(’X’) or .that.contains(’X’), depending on which version reads better in your test case.

Once you save this change, your continuous testing terminal should again report a test failure:

	1 passing (11ms)
	1 failing

	1) parseRDF should parse RDF content:
	AssertionError: expected { id: 132, title: 'The Art of War' } to have a
	property 'authors'
	at Context.it (test/parse-rdf-test.js:39:28)

To make the test pass, recall that we will need to pull out the content from these tags:

	<pgterms:agent rdf:about="2009/agents/4349">
	<pgterms:name>Sunzi, active 6th century B.C.</pgterms:name>
	</pgterms:agent>
	<pgterms:agent rdf:about="2009/agents/5101">
	<pgterms:name>Giles, Lionel</pgterms:name>
	</pgterms:agent>

We’re looking to extract the text of each <pgterms:name> tag that’s a child of a <pgterms:agent>. The CSS selector pgterms:agent pgterms:name finds the elements we need, so we can start with this:

$('pgterms\\:agent pgterms\\:name')

You might be tempted to grab the text straight away like this:

book.authors = $('pgterms\\:agent pgterms\\:name').text();

But unfortunately, this won’t give us what we want, because Cheerio’s text method returns a single string and we need an array of strings. Instead, add the following code to your parse-rdf.js file, after the book.title piece, to correctly extract the authors:

databases/lib/parse-rdf.js

	book.authors = $('pgterms\\:agent pgterms\\:name')
	.toArray().map(elem => $(elem).text());

Calling Cheerio’s .toArray method converts the collection object into a true JavaScript Array. This allows us to use the native map method to create a new array by calling the provided function on each element and grabbing the returned value.

Unfortunately, the collection of objects that comes out of toArray doesn’t consist of Cheerio-wrapped objects, but rather document nodes. To extract the text using Cheerio’s text, we need to wrap each node with the $ function, then call text on it. The resulting mapping function is elem => $(elem).text().

Traversing the Document

Finally, we’re down to just one more piece of information we wanted to pull from the RDF file—the list of subjects.

	<dcterms:subject>
	<rdf:Description rdf:nodeID="N26bb21da0c924e5abcd5809a47f231e7">
	<dcam:memberOf rdf:resource="http://purl.org/dc/terms/LCSH"/>
	<rdf:value>Military art and science -- Early works to 1800</rdf:value>
	</rdf:Description>
	</dcterms:subject>

	<dcterms:subject>
	<rdf:Description rdf:nodeID="N269948d6ecf64b6caf1c15139afd375b">
	<rdf:value>War -- Early works to 1800</rdf:value>
	<dcam:memberOf rdf:resource="http://purl.org/dc/terms/LCSH"/>
	</rdf:Description>
	</dcterms:subject>

As with previous examples, let’s start by adding a test. Insert the following code into your parse-rdf-test.js after the other tests.

databases/test/parse-rdf-test.js

	expect(book).to.have.a.property('subjects')
	.that.is.an('array').with.lengthOf(2)
	.and.contains('Military art and science -- Early works to 1800')
	.and.contains('War -- Early works to 1800');

Unfortunately, these subjects are a little trickier to pull out than the authors were. It would be nice if we could use the tag structure to craft a simple CSS selector like this:

$('dcterms\\:subject rdf\\:value')

However, this selector would match another tag in the document, which we don’t want.

	<dcterms:subject>
	<rdf:Description rdf:nodeID="Nfb797557d91f44c9b0cb80a0d207eaa5">
	<dcam:memberOf rdf:resource="http://purl.org/dc/terms/LCC"/>
	<rdf:value>U</rdf:value>
	</rdf:Description>
	</dcterms:subject>

To spot the difference, look at the <dcam:memberOf> tags’ rdf:resource URLs. The ones we want end in LCSH, which stands for Library of Congress Subject Headings.^[42] These headings are a collection of rich indexing terms used in bibliographic records.

Contrast that with the tag we don’t want to match, which ends in LCC. This stands for Library of Congress Classification.^[43] These are codes that divide all knowledge into 21 top-level classes (like U for Military Science) with many subclasses. These could be interesting in the future, but right now we only want the Subject Headings.

With your continuous test still failing, here’s the code you can add to your parse-rdf.js to make it pass:

databases/lib/parse-rdf.js

	book.subjects = $('[rdf\\:resource$="/LCSH"]')
	.parent().find('rdf\\:value')
	.toArray().map(elem => $(elem).text());

Let’s break this down. First, we select the <dcam:memberOf> tags of interest with the CSS selector [rdf\:resource$="/LCSH"]. The brackets introduce a CSS attribute selector, and the $= indicates that we want elements whose rdf:resource attribute ends with /LCSH.

Next, we use Cheerio’s .parent method to traverse up to our currently selected elements’ parents. In this case, those are the <rdf:Description> tags. Then we traverse back down using .find to locate all of their <rdf:value> tags.

Lastly, just like with the book authors, we convert the Cheerio selection object into a true Array and use .map to get each element’s text. And that’s it! At this point your tests should be passing, meaning your parseRDF function is correctly extracting the data we want.

Anticipating Format Changes

One quick note before we move on—an older version of the Project Gutenberg RDF format had its subjects listed like this:

	<dcterms:subject>
	<rdf:Description>
	<dcam:memberOf rdf:resource="http://purl.org/dc/terms/LCSH"/>
	<rdf:value>Military art and science -- Early works to 1800</rdf:value>
	<rdf:value>War -- Early works to 1800</rdf:value>
	</rdf:Description>
	</dcterms:subject>

Instead of finding each subject’s <rdf:value> living in its own <dcterms:subject> tag, we find them bunched together under a single one. Now consider the traversal code we just wrote. By finding the /LCSH tag, going up to its parent <rdf:Description>, and then searching down for <rdf:value> tags, our code would work with both this earlier data format and the current one (at the time of this writing, anyway).

Whenever you work with third-party data, there’s a chance that it could change over time. When it does, your code may or may not continue to work as expected. There’s no hard and fast rule to tell you when to be more or less specific with your data-processing code, but I encourage you to stay vigilant to these kinds of issues in your work.

The beauty of testing in these scenarios is that when a data format changes, you can add more tests. This gives you confidence that you’re meeting the new demands of the updated data format while still honoring past data.

Recapping Data Extraction with Cheerio

After all of the incremental additions of the last several sections, here’s what your final parse-rdf-test.js should look like:

databases/test/parse-rdf-test.js

	'use strict';

	const fs = require('fs');
	const expect = require('chai').expect;
	const parseRDF = require('../lib/parse-rdf.js');

	const rdf = fs.readFileSync(`${__dirname}/pg132.rdf`);

	describe('parseRDF', () => {
	it('should be a function', () => {
	expect(parseRDF).to.be.a('function');
	});

	it('should parse RDF content', () => {
	const book = parseRDF(rdf);

	expect(book).to.be.an('object');
	expect(book).to.have.a.property('id', 132);
	expect(book).to.have.a.property('title', 'The Art of War');

	expect(book).to.have.a.property('authors')
	.that.is.an('array').with.lengthOf(2)
	.and.contains('Sunzi, active 6th century B.C.')
	.and.contains('Giles, Lionel');

	expect(book).to.have.a.property('subjects')
	.that.is.an('array').with.lengthOf(2)
	.and.contains('Military art and science -- Early works to 1800')
	.and.contains('War -- Early works to 1800');
	});
	});

And here’s the parse-rdf.js itself:

databases/lib/parse-rdf.js

	'use strict';
	const cheerio = require('cheerio');

	module.exports = rdf => {
	const $ = cheerio.load(rdf);

	const book = {};

	book.id = +$('pgterms\\:ebook').attr('rdf:about').replace('ebooks/', '');

	book.title = $('dcterms\\:title').text();

	book.authors = $('pgterms\\:agent pgterms\\:name')
	.toArray().map(elem => $(elem).text());

	book.subjects = $('[rdf\\:resource$="/LCSH"]')
	.parent().find('rdf\\:value')
	.toArray().map(elem => $(elem).text());

	return book;
	};

Using this, we can now quickly put together a command-line program to explore some of the other RDF files. Open your editor and enter this:

databases/rdf-to-json.js

	#!/usr/bin/env node
	const fs = require('fs');
	const parseRDF = require('./lib/parse-rdf.js');
	const rdf = fs.readFileSync(process.argv[2]);
	const book = parseRDF(rdf);
	console.log(JSON.stringify(book, null, ' '));

Save this file as rdf-to-json.js in your databases project directory. This program simply takes the name of an RDF file, reads its contents, parses them, and then prints the resulting JSON to standard output.

Previously when calling JSON.stringify, we passed only one argument, the object to be serialized. Here we’re passing three arguments to get a prettier output. The second argument (null) is an optional replacer function that can be used for filtering (this is almost never used in practice). The last argument (’ ’) is used to indent nested objects, making the output more human-readable.

Let’s try it! Open a terminal to your databases project directory and run this:

	$ node rdf-to-json.js ../data/cache/epub/11/pg11.rdf
	{
	"id": 11,
	"title": "Alice's Adventures in Wonderland",
	"authors": [
	"Carroll, Lewis"
	],
	"subjects": [
	"Fantasy"
	]
	}

If you see this, great! It’s time to start performing these conversions in bulk.

Previous Chapter

Behavior-Driven Development with Mocha and Chai

Next Chapter

Processing Data Files Sequentially

Table of Contents for Node.js 8 the Right Way

Extracting Data from XML with Cheerio

Considering XML Data Extraction Options

Getting Started with Cheerio

Reading Data from an Attribute

Reading the Text of a Node

Collecting an Array of Values

Traversing the Document

Anticipating Format Changes

Recapping Data Extraction with Cheerio

Table of Contents for
Node.js 8 the Right Way