Before we can start manipulating data with Node.js, we have to get it. The data we’ll be using comes from Project Gutenberg, which is dedicated to making public-domain works available as free ebooks.[34]
Project Gutenberg produces catalog download bundles that contain Resource Description Framework (RDF) files for each of its 53,000-plus books. (RDF is an XML-based format.) The bz2 compressed version of the catalog file is about 40 MB. Fully extracted, it contains a little over 1 GB of RDF files.
To begin, create two sibling directories on your machine, called databases and data.
| | $ mkdir databases |
| | $ mkdir data |
The databases project directory will hold all of the programs and configuration files you’ll be developing in this chapter. Unless otherwise specified, commands you run will be from a terminal out of this directory.
The data directory will hold the raw data files that we’ll be working with. If you want to put this directory somewhere else for storage reasons, that’s fine, but the examples in this chapter will assume that it’s a sibling of your databases project directory, so modify any paths accordingly.
With that out of the way, open a terminal to your data directory and run the following commands:
| | $ cd data |
| | $ curl -O http://www.gutenberg.org/cache/epub/feeds/rdf-files.tar.bz2 |
| | $ tar -xvjf rdf-files.tar.bz2 |
| | x cache/epub/0/pg0.rdf |
| | x cache/epub/1/pg1.rdf |
| | x cache/epub/10/pg10.rdf |
| | ... |
| | x cache/epub/9998/pg9998.rdf |
| | x cache/epub/9999/pg9999.rdf |
| | x cache/epub/999999/pg999999.rdf |
This will create a cache directory that contains all the RDF files. Each RDF file is named after its Project Gutenberg ID and contains the metadata about one book. For example, book number 132 is Lionel Giles’s 1910 translation of The Art of War, by Sunzi.
Here’s a very stripped-down excerpt from cache/epub/132/pg132.rdf that shows only the fields that we care about and some surrounding detail:
| | <rdf:RDF> |
| | <pgterms:ebook rdf:about="ebooks/132"> |
| | <dcterms:title>The Art of War</dcterms:title> |
| | <pgterms:agent rdf:about="2009/agents/4349"> |
| | <pgterms:name>Sunzi, active 6th century B.C.</pgterms:name> |
| | </pgterms:agent> |
| | <pgterms:agent rdf:about="2009/agents/5101"> |
| | <pgterms:name>Giles, Lionel</pgterms:name> |
| | </pgterms:agent> |
| | <dcterms:subject> |
| | <rdf:Description rdf:nodeID="N26bb21da0c924e5abcd5809a47f231e7"> |
| | <dcam:memberOf rdf:resource="http://purl.org/dc/terms/LCSH"/> |
| | <rdf:value>Military art and science -- Early works to 1800</rdf:value> |
| | </rdf:Description> |
| | </dcterms:subject> |
| | <dcterms:subject> |
| | <rdf:Description rdf:nodeID="N269948d6ecf64b6caf1c15139afd375b"> |
| | <rdf:value>War -- Early works to 1800</rdf:value> |
| | <dcam:memberOf rdf:resource="http://purl.org/dc/terms/LCSH"/> |
| | </rdf:Description> |
| | </dcterms:subject> |
| | </pgterms:ebook> |
| | </rdf:RDF> |
The important pieces of information that we’d like to extract are as follows:
Ideally, we’d like to have all of this information formatted as a JSON document suitable for passing in to a document database. For this particular book, our desired JSON would be this:
| | { |
| | "id": 132, |
| | "title": "The Art of War", |
| | "authors": [ |
| | "Sunzi, active 6th century B.C.", |
| | "Giles, Lionel" |
| | ], |
| | "subjects": [ |
| | "Military art and science -- Early works to 1800", |
| | "War -- Early works to 1800" |
| | ] |
| | } |
To get this nice JSON representation, we’ll have to parse the RDF file. On the way there, this provides a great opportunity to explore the BDD pattern.