Table of Contents for
Node.js 8 the Right Way

Version ebook / Retour

Cover image for bash Cookbook, 2nd Edition Node.js 8 the Right Way by Jim Wilson Published by Pragmatic Bookshelf, 2018
  1. Title Page
  2. Node.js 8 the Right Way
  3. Node.js 8 the Right Way
  4. Node.js 8 the Right Way
  5. Node.js 8 the Right Way
  6.  Acknowledgments
  7.  Preface
  8. Why Node.js the Right Way?
  9. What’s in This Book
  10. What This Book Is Not
  11. Code Examples and Conventions
  12. Online Resources
  13. Part I. Getting Up to Speed on Node.js 8
  14. 1. Getting Started
  15. Thinking Beyond the web
  16. Node.js’s Niche
  17. How Node.js Applications Work
  18. Aspects of Node.js Development
  19. Installing Node.js
  20. 2. Wrangling the File System
  21. Programming for the Node.js Event Loop
  22. Spawning a Child Process
  23. Capturing Data from an EventEmitter
  24. Reading and Writing Files Asynchronously
  25. The Two Phases of a Node.js Program
  26. Wrapping Up
  27. 3. Networking with Sockets
  28. Listening for Socket Connections
  29. Implementing a Messaging Protocol
  30. Creating Socket Client Connections
  31. Testing Network Application Functionality
  32. Extending Core Classes in Custom Modules
  33. Developing Unit Tests with Mocha
  34. Wrapping Up
  35. 4. Connecting Robust Microservices
  36. Installing ØMQ
  37. Publishing and Subscribing to Messages
  38. Responding to Requests
  39. Routing and Dealing Messages
  40. Clustering Node.js Processes
  41. Pushing and Pulling Messages
  42. Wrapping Up
  43. Node.js 8 the Right Way
  44. Part II. Working with Data
  45. 5. Transforming Data and Testing Continuously
  46. Procuring External Data
  47. Behavior-Driven Development with Mocha and Chai
  48. Extracting Data from XML with Cheerio
  49. Processing Data Files Sequentially
  50. Debugging Tests with Chrome DevTools
  51. Wrapping Up
  52. 6. Commanding Databases
  53. Introducing Elasticsearch
  54. Creating a Command-Line Program in Node.js with Commander
  55. Using request to Fetch JSON over HTTP
  56. Shaping JSON with jq
  57. Inserting Elasticsearch Documents in Bulk
  58. Implementing an Elasticsearch Query Command
  59. Wrapping Up
  60. Node.js 8 the Right Way
  61. Part III. Creating an Application from the Ground Up
  62. 7. Developing RESTful Web Services
  63. Advantages of Express
  64. Serving APIs with Express
  65. Writing Modular Express Services
  66. Keeping Services Running with nodemon
  67. Adding Search APIs
  68. Simplifying Code Flows with Promises
  69. Manipulating Documents RESTfully
  70. Emulating Synchronous Style with async and await
  71. Providing an Async Handler Function to Express
  72. Wrapping Up
  73. 8. Creating a Beautiful User Experience
  74. Getting Started with webpack
  75. Generating Your First webpack Bundle
  76. Sprucing Up Your UI with Bootstrap
  77. Bringing in Bootstrap JavaScript and jQuery
  78. Transpiling with TypeScript
  79. Templating HTML with Handlebars
  80. Implementing hashChange Navigation
  81. Listing Objects in a View
  82. Saving Data with a Form
  83. Wrapping Up
  84. 9. Fortifying Your Application
  85. Setting Up the Initial Project
  86. Managing User Sessions in Express
  87. Adding Authentication UI Elements
  88. Setting Up Passport
  89. Authenticating with Facebook, Twitter, and Google
  90. Composing an Express Router
  91. Bringing in the Book Bundle UI
  92. Serving in Production
  93. Wrapping Up
  94. Node.js 8 the Right Way
  95. 10. BONUS: Developing Flows with Node-RED
  96. Setting Up Node-RED
  97. Securing Node-RED
  98. Developing a Node-RED Flow
  99. Creating HTTP APIs with Node-RED
  100. Handling Errors in Node-RED Flows
  101. Wrapping Up
  102. A1. Setting Up Angular
  103. A2. Setting Up React
  104. Node.js 8 the Right Way

Extracting Data from XML with Cheerio

At this point, you should have two successfully passing tests in test/parse-rdf-test.js that are powered by your module in lib/parse-rdf.js. In this section, we’ll expand the tests to cover all of our requirements for parsing Project Gutenberg RDF files, and implement the library code to make them pass.

To extract the data attributes we desire, we’ll need to parse the RDF (XML) file. As with everything in the Node.js ecosystem, there are multiple valid approaches to parsing, navigating, and querying XML files.

Let’s discuss some of the options, then move on to installing and using Cheerio.

Considering XML Data Extraction Options

In this chapter, we will be treating the RDF files like regular, undifferentiated XML files for parsing and for data extraction. The benefit to you (as opposed to addressing them specifically as RDF/XML) is that the skills and techniques you learn will transfer to parsing other kinds of XML and HTML documents.

For situations like this, where the documents are relatively small, I prefer to use Cheerio, a fast Node.js module that provides a jQuery-like API for working with HTML and XML documents.[36] Cheerio’s big advantage is that it offers a convenient way to use CSS selectors to dig into the document, without the overhead of setting up a browser-like environment.

Cheerio isn’t the only DOM-like XML parser for Node.js—far from it. Other popular options include xmldom and jsdom,[37] [38] both of which are based on the W3C’s DOM specification.

In your own projects, if the XML files that you’re working with are quite large, then you’re probably going to want a streaming SAX parser instead. SAX, which stands for Simple API for XML, treats XML as a stream of tokens that your program digests in sequence. Unlike a DOM parser, which considers the document as a whole, a SAX parser operates on only a small piece at a time.

Compared to DOM parsers, SAX parsers can be quite fast and memory-efficient. But the downside of using a SAX parser is that your program will have to keep track of the document structure in flight. I’ve had good experiences using the sax Node.js module for parsing large XML files.[39]

Speaking of RDF/XML in particular, it’s a rich data format for which custom tooling is available. If you find yourself working with linked data in the wild, you may find it more convenient to convert it to JSON for Linked Data (JSON-LD) and then perform additional operations from there.

JSON-LD is to JSON as RDF is to XML.[40] With JSON-LD, you can express relationships between entities, not just a hierarchical structure like JSON allows. The jsonld module would be a good place to start for this.[41]

Which of these approaches is best for you really comes down to your use case and personal taste. If your documents are large, then you’ll probably want a SAX parser. If you need to preserve the structured relationships in the data, then JSON-LD may be best. Do you need to fetch remote documents? Some modules have this capability built in (Cheerio does not).

Our task at hand is to extract a small amount of data from relatively small files that are readily available locally. I find Cheerio to be an excellent fit for this particular kind of task, and I hope you will too!

Getting Started with Cheerio

To get started with Cheerio, install it with npm and save the dependency.

 $ ​​npm​​ ​​install​​ ​​--save​​ ​​--save-exact​​ ​​cheerio@0.22.0

Please be careful with the version number here. Cheerio has not historically followed the semantic versioning convention, introducing breaking changes in minor releases. If you install any version other than 0.22.0, the examples in this book may not work.

Before we start using Cheerio, let’s create some BDD tests that we can make pass by doing so. If Mocha is not already running continuously, open a terminal to your databases project directory and run the following:

 $ ​​npm​​ ​​run​​ ​​test:watch

It should clear the screen and report two passing tests:

 2 passing (44ms)

Great; now let’s require that the book object returned by parseRDF has the correct numeric ID for The Art of War. Open your parse-rdf-test.js file and expand the second test by adding a check that the book object has an id property containing the number 132.

 it(​'should parse RDF content'​, () => {
 const​ book = parseRDF(rdf);
  expect(book).to.be.an(​'object'​);
  expect(book).to.have.a.property(​'id'​, 132);
 });

This code takes advantage of Chai’s sentence-like BDD API, which we’ll use in increasing doses as we add more tests.

Since we have not yet implemented the code to include the ‘id‘ in the returned ‘book‘ object, as soon as you save the file, your Mocha terminal should report this:

 1 passing (4ms)
 1 failing
 
 1) parseRDF should parse RDF content:
  AssertionError: expected {} to have a property 'id'
  at Context.it (test/parse-rdf-test.js:32:28)

Good! The test is failing exactly as we expect it should.

Now it’s time to use Cheerio to pull out the four fields we want: the book’s ID, the title, the authors, and the subjects.

Reading Data from an Attribute

The first piece of information we hope to extract using Cheerio is the book’s ID. Recall that we’re trying to grab the number 132 out of this XML tag:

 <pgterms:ebook rdf:about=​"ebooks/132"​>

Open your lib/parse-rdf.js file and make it look like the following:

 'use strict'​;
 const​ cheerio = require(​'cheerio'​);
 
 module.exports = rdf => {
 const​ $ = cheerio.load(rdf);
 
 const​ book = {};
 
  book.id = +$(​'pgterms​​\\​​:ebook'​).attr(​'rdf:about'​).replace(​'ebooks/'​, ​''​);
 
 return​ book;
 };

This code adds three things to the version listed in Enabling Continuous Testing with Mocha:

  • At the top, we now require Cheerio.

  • Inside the exported function, we use Cheerio’s load method to parse the rdf content. The $ function that’s returned is very much like jQuery’s $ function.

  • Using Cheerio’s APIs, we extract the book’s ID and, finally, format it.

The line where we set book.id is fairly dense, so let’s break it down. Here’s the same line, but split out and commented so we can dissect it:

 book.id = ​// Set the book's id.
  + ​// Unary plus casts the result as a number.
  $(​'pgterms​​\\​​:ebook'​) ​// Query for the <pgterms:ebook> tag.
  .attr(​'rdf:about'​) ​// Get the value of the rdf:about attribute.
  .replace(​'ebooks/'​, ​''​); ​// Strip off the leading 'ebooks/' substring.

In CSS, the colon character (:) has special meaning—it is used to introduce pseudo selectors like :hover for links that are hovered over. In our case, we need a literal colon character for the <pgterms:ebook> tag name, so we have to escape it with a backslash. But since the backslash is a special character in JavaScript string literals, that too needs to be escaped-. Thus, our query selector for finding the tag is pgterms\\:ebook.

Once we have selected the pgterms:ebook tag, we pull out the rdf:about attribute value and strip off the leading ebooks/ substring, leaving only the string "132". The leading unary plus (+) at the start of the line ensures that this gets cast as a number.

If all has gone well so far, your terminal running Mocha’s continuous testing should again read 2 passing.

Reading the Text of a Node

Next, let’s add a test for the title of the book. Insert the following code right after the test for the book’s ID.

 expect(book).to.have.a.property(​'title'​, ​'The Art of War'​);

Your continuous testing terminal should read as follows:

 1 passing (3ms)
 1 failing
 
 1) parseRDF should parse RDF content:
  AssertionError: expected { id: 132 } to have a property 'title'
  at Context.it (test/parse-rdf-test.js:35:28)

Now let’s grab the title and add it to the returned book object. Recall that the XML containing the title looks like this:

 <dcterms:title>The Art of War</dcterms:title>

Getting this content is even easier than extracting the ID. Add the following to your parse-rdf.js file, after the line where we set book.id:

 book.title = $(​'dcterms​​\\​​:title'​).text();

Using Cheerio, we select the tag named dcterms:title and save its text content to the book.text property. Once you save this file, your tests should pass again.

Collecting an Array of Values

Moving on, let’s add tests for the array of book authors. Open your parse-rdf-test.js file and add these lines:

 expect(book).to.have.a.property(​'authors'​)
  .that.is.an(​'array'​).​with​.lengthOf(2)
  .and.contains(​'Sunzi, active 6th century B.C.'​)
  .and.contains(​'Giles, Lionel'​);

Here we really start to see the expressive power of Chai assertions. This line of code reads almost like an English sentence.

Expect book to have a property called authors that is an array of length two and contains “Sunzi, active 6th century B.C.” and “Giles, Lionel”.

In Chai’s language-chaining model, words like and, that, and which are largely interchangeable. This lets you write clauses like .and.contains(’X’) or .that.contains(’X’), depending on which version reads better in your test case.

Once you save this change, your continuous testing terminal should again report a test failure:

 1 passing (11ms)
 1 failing
 
 1) parseRDF should parse RDF content:
  AssertionError: expected { id: 132, title: 'The Art of War' } to have a
  property 'authors'
  at Context.it (test/parse-rdf-test.js:39:28)

To make the test pass, recall that we will need to pull out the content from these tags:

 <pgterms:agent rdf:about=​"2009/agents/4349"​>
  <pgterms:name>Sunzi, active 6th century B.C.</pgterms:name>
 </pgterms:agent>
 <pgterms:agent rdf:about=​"2009/agents/5101"​>
  <pgterms:name>Giles, Lionel</pgterms:name>
 </pgterms:agent>

We’re looking to extract the text of each <pgterms:name> tag that’s a child of a <pgterms:agent>. The CSS selector pgterms:agent pgterms:name finds the elements we need, so we can start with this:

 $(​'pgterms​​\\​​:agent pgterms​​\\​​:name'​)

You might be tempted to grab the text straight away like this:

 book.authors = $(​'pgterms​​\\​​:agent pgterms​​\\​​:name'​).text();

But unfortunately, this won’t give us what we want, because Cheerio’s text method returns a single string and we need an array of strings. Instead, add the following code to your parse-rdf.js file, after the book.title piece, to correctly extract the authors:

 book.authors = $(​'pgterms​​\\​​:agent pgterms​​\\​​:name'​)
  .toArray().map(elem => $(elem).text());

Calling Cheerio’s .toArray method converts the collection object into a true JavaScript Array. This allows us to use the native map method to create a new array by calling the provided function on each element and grabbing the returned value.

Unfortunately, the collection of objects that comes out of toArray doesn’t consist of Cheerio-wrapped objects, but rather document nodes. To extract the text using Cheerio’s text, we need to wrap each node with the $ function, then call text on it. The resulting mapping function is elem => $(elem).text().

Traversing the Document

Finally, we’re down to just one more piece of information we wanted to pull from the RDF file—the list of subjects.

 <dcterms:subject>
  <rdf:Description rdf:nodeID=​"N26bb21da0c924e5abcd5809a47f231e7"​>
  <dcam:memberOf rdf:resource=​"http://purl.org/dc/terms/LCSH"​/>
  <rdf:value>Military art and science -- Early works to 1800</rdf:value>
  </rdf:Description>
 </dcterms:subject>
 <dcterms:subject>
  <rdf:Description rdf:nodeID=​"N269948d6ecf64b6caf1c15139afd375b"​>
  <rdf:value>War -- Early works to 1800</rdf:value>
  <dcam:memberOf rdf:resource=​"http://purl.org/dc/terms/LCSH"​/>
  </rdf:Description>
 </dcterms:subject>

As with previous examples, let’s start by adding a test. Insert the following code into your parse-rdf-test.js after the other tests.

 expect(book).to.have.a.property(​'subjects'​)
  .that.is.an(​'array'​).​with​.lengthOf(2)
  .and.contains(​'Military art and science -- Early works to 1800'​)
  .and.contains(​'War -- Early works to 1800'​);

Unfortunately, these subjects are a little trickier to pull out than the authors were. It would be nice if we could use the tag structure to craft a simple CSS selector like this:

 $(​'dcterms​​\\​​:subject rdf​​\\​​:value'​)

However, this selector would match another tag in the document, which we don’t want.

 <dcterms:subject>
  <rdf:Description rdf:nodeID=​"Nfb797557d91f44c9b0cb80a0d207eaa5"​>
  <dcam:memberOf rdf:resource=​"http://purl.org/dc/terms/LCC"​/>
  <rdf:value>U</rdf:value>
  </rdf:Description>
 </dcterms:subject>

To spot the difference, look at the <dcam:memberOf> tags’ rdf:resource URLs. The ones we want end in LCSH, which stands for Library of Congress Subject Headings.[42] These headings are a collection of rich indexing terms used in bibliographic records.

Contrast that with the tag we don’t want to match, which ends in LCC. This stands for Library of Congress Classification.[43] These are codes that divide all knowledge into 21 top-level classes (like U for Military Science) with many subclasses. These could be interesting in the future, but right now we only want the Subject Headings.

With your continuous test still failing, here’s the code you can add to your parse-rdf.js to make it pass:

 book.subjects = $(​'[rdf​​\\​​:resource$="/LCSH"]'​)
  .parent().find(​'rdf​​\\​​:value'​)
  .toArray().map(elem => $(elem).text());

Let’s break this down. First, we select the <dcam:memberOf> tags of interest with the CSS selector [rdf\:resource$="/LCSH"]. The brackets introduce a CSS attribute selector, and the $= indicates that we want elements whose rdf:resource attribute ends with /LCSH.

Next, we use Cheerio’s .parent method to traverse up to our currently selected elements’ parents. In this case, those are the <rdf:Description> tags. Then we traverse back down using .find to locate all of their <rdf:value> tags.

Lastly, just like with the book authors, we convert the Cheerio selection object into a true Array and use .map to get each element’s text. And that’s it! At this point your tests should be passing, meaning your parseRDF function is correctly extracting the data we want.

Anticipating Format Changes

One quick note before we move on—an older version of the Project Gutenberg RDF format had its subjects listed like this:

 <dcterms:subject>
  <rdf:Description>
  <dcam:memberOf rdf:resource=​"http://purl.org/dc/terms/LCSH"​/>
  <rdf:value>Military art and science -- Early works to 1800</rdf:value>
  <rdf:value>War -- Early works to 1800</rdf:value>
  </rdf:Description>
 </dcterms:subject>

Instead of finding each subject’s <rdf:value> living in its own <dcterms:subject> tag, we find them bunched together under a single one. Now consider the traversal code we just wrote. By finding the /LCSH tag, going up to its parent <rdf:Description>, and then searching down for <rdf:value> tags, our code would work with both this earlier data format and the current one (at the time of this writing, anyway).

Whenever you work with third-party data, there’s a chance that it could change over time. When it does, your code may or may not continue to work as expected. There’s no hard and fast rule to tell you when to be more or less specific with your data-processing code, but I encourage you to stay vigilant to these kinds of issues in your work.

The beauty of testing in these scenarios is that when a data format changes, you can add more tests. This gives you confidence that you’re meeting the new demands of the updated data format while still honoring past data.

Recapping Data Extraction with Cheerio

After all of the incremental additions of the last several sections, here’s what your final parse-rdf-test.js should look like:

 'use strict'​;
 
 const​ fs = require(​'fs'​);
 const​ expect = require(​'chai'​).expect;
 const​ parseRDF = require(​'../lib/parse-rdf.js'​);
 
 const​ rdf = fs.readFileSync(​`​${__dirname}​/pg132.rdf`​);
 
 describe(​'parseRDF'​, () => {
  it(​'should be a function'​, () => {
  expect(parseRDF).to.be.a(​'function'​);
  });
 
  it(​'should parse RDF content'​, () => {
 const​ book = parseRDF(rdf);
 
  expect(book).to.be.an(​'object'​);
  expect(book).to.have.a.property(​'id'​, 132);
  expect(book).to.have.a.property(​'title'​, ​'The Art of War'​);
 
  expect(book).to.have.a.property(​'authors'​)
  .that.is.an(​'array'​).​with​.lengthOf(2)
  .and.contains(​'Sunzi, active 6th century B.C.'​)
  .and.contains(​'Giles, Lionel'​);
 
  expect(book).to.have.a.property(​'subjects'​)
  .that.is.an(​'array'​).​with​.lengthOf(2)
  .and.contains(​'Military art and science -- Early works to 1800'​)
  .and.contains(​'War -- Early works to 1800'​);
  });
 });

And here’s the parse-rdf.js itself:

 'use strict'​;
 const​ cheerio = require(​'cheerio'​);
 
 module.exports = rdf => {
 const​ $ = cheerio.load(rdf);
 
 const​ book = {};
 
  book.id = +$(​'pgterms​​\\​​:ebook'​).attr(​'rdf:about'​).replace(​'ebooks/'​, ​''​);
 
  book.title = $(​'dcterms​​\\​​:title'​).text();
 
  book.authors = $(​'pgterms​​\\​​:agent pgterms​​\\​​:name'​)
  .toArray().map(elem => $(elem).text());
  book.subjects = $(​'[rdf​​\\​​:resource$="/LCSH"]'​)
  .parent().find(​'rdf​​\\​​:value'​)
  .toArray().map(elem => $(elem).text());
 
 return​ book;
 };

Using this, we can now quickly put together a command-line program to explore some of the other RDF files. Open your editor and enter this:

 #!/usr/bin/env node
 const​ fs = require(​'fs'​);
 const​ parseRDF = require(​'./lib/parse-rdf.js'​);
 const​ rdf = fs.readFileSync(process.argv[2]);
 const​ book = parseRDF(rdf);
 console.log(JSON.stringify(book, ​null​, ​' '​));

Save this file as rdf-to-json.js in your databases project directory. This program simply takes the name of an RDF file, reads its contents, parses them, and then prints the resulting JSON to standard output.

Previously when calling JSON.stringify, we passed only one argument, the object to be serialized. Here we’re passing three arguments to get a prettier output. The second argument (null) is an optional replacer function that can be used for filtering (this is almost never used in practice). The last argument (’ ’) is used to indent nested objects, making the output more human-readable.

Let’s try it! Open a terminal to your databases project directory and run this:

 $ ​​node​​ ​​rdf-to-json.js​​ ​​../data/cache/epub/11/pg11.rdf
 {
  "id": 11,
  "title": "Alice's Adventures in Wonderland",
  "authors": [
  "Carroll, Lewis"
  ],
  "subjects": [
  "Fantasy"
  ]
 }

If you see this, great! It’s time to start performing these conversions in bulk.