Broadly speaking, there are two kinds of data: the kind that your own apps produce and the kind that comes from somewhere else. It would be nice if you only ever had to deal with data that you created. But the reality is that you’ll almost certainly have to work with outside data sources during your career, perhaps frequently!
Between this chapter and the next, you’ll use Node.js to take real data from the wild and put it into your own local datastore. This work can be neatly approached in two phases: transforming the raw data into an intermediate format, and importing that intermediate data into the datastore.

In this chapter, you’ll learn how to use Node.js to transform XML data into the lingua franca of modern data formats, JSON and its close cousin line-delimited JSON (LDJ). Then, in the following chapter, you’ll create a command-line tool to bring this LDJ content into Elasticsearch, a NoSQL database that indexes JSON objects.
While writing, testing, and debugging tools to transform raw XML data into LDJ, we’ll investigate the following aspects of Node.js:
Using Chrome DevTools, it’s possible to inspect a running Node.js application. You’ll learn how to set breakpoints, step through your running Node.js code, and interrogate scoped variables.
Much of this chapter involves extracting data from XML files and transforming it into JSON for insertion into a document database. We’ll use Cheerio for this, a DOM-based XML parser with a jQuery-like API. To use it effectively, you’ll learn the basics of CSS selectors.
In the Node.js ecosystem, it’s fairly common to have modules that export a single stateless function rather than a collection of objects, classes, or methods. In this chapter, you’ll develop such a module iteratively using behavior-driven development (BDD) techniques.
We’re going to double down on npm in this chapter, adding scripts to launch Mocha tests in standalone mode, continuous testing mode, and debug mode. You’ll also learn to use Chai, an assertion library that pairs well with Mocha to write expressive, behavioral tests.
To kick off the chapter, we have to procure the data that we’re going to be working with. Then we’ll pick through it to get an understanding of the data format, as well as our desired output.
For processing data, it’s quite useful to develop unit tests. For this reason, prior to developing the data-processing code we’ll set up the infrastructure for continuously running Mocha tests. Moreover, we’ll approach the problem through BDD techniques by using Chai, a popular assertion library.
Getting into the nitty-gritty details of querying the raw XML data, we’ll use Cheerio, a module that lets you dive into HTML and XML data by using CSS selectors to find interesting elements. Don’t worry if you’re not yet familiar with writing CSS selectors—it’s a useful skill, and we’ll build up gradually.
In the final part of this chapter, you’ll use the parsing code to roll through the raw data and produce new data that’s ready for insertion into a database. You’ll learn to walk down a directory tree sequentially, and how to step through your code using Chrome DevTools.
It’s a lot to cover, but you can do it. Let’s get started!