By now your lib/parse-rdf.js is a robust module that can reliably convert RDF content into JSON documents. All that remains is to walk through the Project Gutenberg catalog directory and collect all the JSON documents.
More concretely, we need to do the following:
The NoSQL database we’ll be using is Elasticsearch, a document datastore that indexes JSON objects. Soon, in Chapter 6, Commanding Databases, we’ll dive deep into Elasticsearch and how to effectively use it with Node.js. You’ll learn how to install it, configure it, and make the most of its HTTP-based APIs.
For now, though, our focus is just on transforming the Gutenberg data into an intermediate form for bulk import.
Conveniently, Elasticsearch has a bulk-import API that lets you pull in many records at once. Although we could insert them one at a time, it is significantly faster to use the bulk-insert API.
The format of the file we need to create is described on Elasticsearch’s Bulk API page.[44] It’s an LDJ file consisting of actions and the source objects on which to perform each action.
In our case, we’re performing index operations—that is, inserting new documents into an index. Each source object is the book object returned by parseRDF. Here’s an example of an action followed by its source object:
| | {"index":{"_id":"pg11"}} |
| | {"id":11,"title":"Alice's Adventures in Wonderland","authors":...} |
And here’s another one:
| | {"index":{"_id":"pg132"}} |
| | {"id":132,"title":"The Art of War","authors":...} |
In each case, an action is a JSON object on a line by itself, and the source object is another JSON object on the next line. Elasticsearch’s bulk API allows you to chain any number of these together like so:
| | {"index":{"_id":"pg11"}} |
| | {"id":11,"title":"Alice's Adventures in Wonderland","authors":...} |
| | {"index":{"_id":"pg132"}} |
| | {"id":132,"title":"The Art of War","authors":...} |
The _id field of each index operation is the unique identifier that Elasticsearch will use for the document. Here I’ve chosen to use the string pg followed by the Project Gutenberg ID. This way, if we ever wanted to store documents from another source in the same index, they shouldn’t collide with the Project Gutenberg book data.
To find and open each of the RDF files under the data/cache/epub directory, we will use a module called node-dir. Install and save it as usual. Then we will begin like this:
| | $ npm install --save --save-exact node-dir@0.1.16 |
This module comes with a handful of useful methods for walking a directory tree. The method we’ll use is readFiles, which sequentially operates on files as it encounters them while walking a directory tree.
Let’s use this method to find all the RDF files and send them through our RDF parser. Open a text editor and enter this:
| | 'use strict'; |
| | |
| | const dir = require('node-dir'); |
| | const parseRDF = require('./lib/parse-rdf.js'); |
| | |
| | const dirname = process.argv[2]; |
| | |
| | const options = { |
| | match: /\.rdf$/, // Match file names that in '.rdf'. |
| | exclude: ['pg0.rdf'], // Ignore the template RDF file (ID = 0). |
| | }; |
| | |
| | dir.readFiles(dirname, options, (err, content, next) => { |
| | if (err) throw err; |
| | const doc = parseRDF(content); |
| | console.log(JSON.stringify({ index: { _id: `pg${doc.id}` } })); |
| | console.log(JSON.stringify(doc)); |
| | next(); |
| | }); |
Save the file as rdf-to-bulk.js in your databases project directory. This short program walks down the provided directory looking for files that end in rdf, but excluding the template RDF file called pg0.rdf.
As the program reads each file’s content, it runs it through the RDF parser. For output, it produces JSON serialized actions suitable for Elasticsearch’s bulk API.
Run the program, and let’s see what it produces.
| | $ node rdf-to-bulk.js ../data/cache/epub/ | head |
If all went well, you should see 10 lines consisting of interleaved actions and documents—like the following, which has been truncated to fit on the page.
| | {"index":{"_id":"pg1"}} |
| | {"id":1,"title":"The Declaration of Independence of the United States of Ame... |
| | {"index":{"_id":"pg10"}} |
| | {"id":10,"title":"The King James Version of the Bible","authors":[],"subject... |
| | {"index":{"_id":"pg100"}} |
| | {"id":100,"title":"The Complete Works of William Shakespeare","authors":["Sh... |
| | {"index":{"_id":"pg1000"}} |
| | {"id":1000,"title":"La Divina Commedia di Dante: Complete","authors":["Dante... |
| | {"index":{"_id":"pg10000"}} |
| | {"id":10000,"title":"The Magna Carta","authors":["Anonymous"],"subjects":["M... |
Because the head command closes the pipe after echoing the beginning lines, this can sometimes cause Node.js to throw an exception, sending the following to the standard error stream:
| | events.js:160 |
| | throw er; // Unhandled 'error' event |
| | ^ |
| | |
| | Error: write EPIPE |
| | at exports._errnoException (util.js:1022:11) |
| | at WriteWrap.afterWrite [as oncomplete] (net.js:804:14) |
To mitigate this error, you can capture error events on the process.stdout stream. Try adding the following line to rdf-to-bulk.js and rerunning it.
| | process.stdout.on('error', err => process.exit()); |
Now, when head closes the pipe, the next attempt to use console.log will trigger the error event listener and the process will exit silently. If you’re worried about output errors other than EPIPE, you can check the err object’s code property and take action as appropriate.
| | process.stdout.on('error', err => { |
| | if (err.code === 'EPIPE') { |
| | process.exit(); |
| | } |
| | throw err; // Or take any other appropriate action. |
| | }); |
At this point we’re ready to let rdf-to-bulk.js run for real. Use the following command to capture this LDJ output in a new file called bulk_pg.ldj.
| | $ node rdf-to-bulk.js ../data/cache/epub/ > ../data/bulk_pg.ldj |
This will run for quite a while, as rdf-to-bulk.js traverses the epub directory, parses each file, and tacks on the Elasticsearch action for it. When it’s finished, the bulk_pg.ldj file should be about 11 MB.