Chapter 3. Data Meant to Be Read by Machines

Data can be stored in many different formats and file types. Some formats store data in a way easily handled by machines, while others store data in a way meant to be easily readable by a human. Microsoft Word documents are an example of the latter, while CSV, JSON, and XML are examples of the former. In this chapter, we will cover how to read files easily handled by machines, and in Chapters 4 and Chapter 5 we will cover files made for human consumption.

Note

File formats that store data in a way easily understood by machines are commonly referred to as machine readable. Common machine-readable formats include the following:

Comma-Separated Values (CSV)
JavaScript Object Notation (JSON)
Extensible Markup Language (XML)

In spoken and written language, these data formats are typically referred to by their shorter names (e.g., CSV). We will be using these acronyms.

When looking for data or requesting data from an organization or agency, the formats described in this chapter are your best available resource. They are more easily used and ingested by your Python scripts than human-readable formats, and are usually easy to find on data websites.

Creating a Home for Your Code

To follow along with the examples and code in this chapter, you will need to save the files locally on your machine. If you haven’t already, you should create a folder where you are going to keep your Python code and data files. Call it something intuitive, like data_wrangling. Then, create a subfolder inside that folder for code associated with this book (e.g., code). It helps to keep your files organized and intuitively named.

If you follow this tip, you should end up with a folder like this: ~/Projects/data_wrangling/code.

On Unix-based systems (Linux and Mac), the ~ symbol represents your home directory for easy command-line access.

On Windows, the home directory is under Users, so your path will instead be C:\Users\<your_name>\Projects\data_wrangling.

Go ahead and download the code examples from the book’s data repository and move them into your project’s folder. As you follow along in the chapter, we will assume the data from that repository is stored in the same folder where you are writing your Python code. This way, we don’t have to worry about locating the files and can focus instead on importing data with Python.

CSV Data

The first machine-readable file type we will learn about is CSV. CSV files, or CSVs for short, are files that separate data columns with commas. The files themselves have a .csv extension.

Another type of data, called tab-separated values (TSV) data, sometimes gets classified with CSVs. TSVs differ only in that they separate data columns with tabs and not commas. The files themselves usually have a .tsv extension, but sometimes have a .csv extension. Essentially, .tsv and .csv files will act the same in Python.

Note

If the file has a .tsv file extension, then it’s likely TSV data. If the file has a .csv file extension, it’s probably CSV data, but it could be TSV data. Make sure to open and view your file so you know what you’re dealing with before you begin importing your data.

For our CSV sample in this chapter, we will look at data from the World Health Organization (WHO). The WHO has a lot of great datasets in different formats. The one selected for this example contains life expectancy rates worldwide by country. If you visit the web page for life expectany rates data, you will find a couple of different versions of this dataset. For this example, we are using CSV (text only).

If you open the file in your text editor,¹ you will see data holding rows containing the values in Table 3-1.

Table 3-1. Two sample data records^a
CSV headers	Sampler rec 1	Sampler rec 2
Indicator	Life expectancy at age 60 (years)	Life expectancy at birth (years)
PUBLISH STATES	Published	Published
Year	1990	1990
WHO region	Europe	Americas
World Bank income group	High-income	Lower-middle-income
Country	Czech Republic	Belize
Sex	Female	Both sexes
Display Value	19	71
Numeric	19.00000	71.00000
Low	no value	no value
High	no value	no value
Comments	no value	no value
^a Bold items are included in the sample data.

To make the data easier to read, a sample of the data with trimmed-down fields is shown here. You should see something similar to this when you open the CSV file in your text editor:

"Year", "Country","Sex","Display Value","Numeric"
"1990","Andorra","Both sexes","77","77.00000"
"2000","Andorra","Both sexes","80","80.00000"
"2012","Andorra","Female","28","28.00000"
"2000","Andorra","Both sexes","23","23.00000"
"2012","United Arab Emirates","Female","78","78.00000"
"2000","Antigua and Barbuda","Male","72","72.00000"
"1990","Antigua and Barbuda","Male","17","17.00000"
"2012","Antigua and Barbuda","Both sexes","22","22.00000"
"2012","Australia","Male","81","81.00000"

Another way to preview the file is to open it in a spreadsheet program such as Excel or Google Spreadsheets. Those programs will display each entry of data as a separate row.

How to Import CSV Data

Now that we have learned a little bit about the data, let’s open the file in Python and convert the data into a form Python can understand. This only takes a couple of lines of code:

import csv

csvfile = open('data-text.csv', 'rb')
reader = csv.reader(csvfile)

for row in reader:
    print row

Let’s go through each line. In the previous chapter, we did all of our coding in the Python interpreter, but as the code gets longer and more complicated, it is easier to write and run code from a file. After we walk through this bit of code, we will save the code to a .py file, which is a Python file, and run the file from the command line.

The first line of the script imports a library called csv:

import csv

A Python library is a package of code that provides functionality you can use in your Python programs. The csv library we are importing comes with your Python installation as part of the standard library (or stdlib). Once we import a library into our file, we can use it. If we did not have this library, this script would be a lot longer—the csv library gives us helper functions so we don’t have to write as much code to perform a more complex task.

The second line of code takes our data-text.csv file, which should be located in the same folder as the script, and passes it to the open function:

csvfile = open('data-text.csv', 'rb')

Note

A function is a piece of code that performs a task when it is invoked. It is very similar to the Python data type methods we learned about in Chapter 2. Sometimes functions will take an input (or inputs). These inputs are called arguments. The function performs an action based on the arguments. Sometimes functions also return an output, which can then be stored or used.

open is a built-in function in Python, meaning the act of opening a file is so common that the core Python contributors felt it should be added to every Python installation. When we use the open function, we pass a filename as the first argument (here we used 'data-text.csv') and then we optionally specify which mode this file should be opened in (we used 'rb'). If you visit the docs for the open function, you will find that the argument 'rb' means that we open the file as read-only and in binary mode. Opening the file in binary mode allows our code to run on both Windows and Unix-based operating systems. The other common mode is write ('w', or 'wb' for write in binary mode).

Tip

If you want to read, open in read mode. If you intend to write, open the file in write mode.

We store the output of this function in the variable csvfile. csvfile now holds an open file as its value.

In the next line, we pass csvfile to the reader function in the csv module. This function tells the csv module to read the open file as a CSV:

reader = csv.reader(csvfile)

The output of the function csv.reader(csvfile) is stored in the reader variable. The reader variable now holds a Python CSV reader with the opened file. This CSV reader allows us to easily view data from our file using simple Python commands. In the last piece of code, we have something called a for loop.

A for loop is a way of iterating over Python objects, commonly used with lists. A for loop tells Python code, “For each thing in this list of things, do something.” The first word used after for in a for loop is the variable that will hold each object in the list (or other iterable object). The code below the for loop uses that variable to perform more functions or calculations on the item. For that reason, it is best to use a word which conveys meaning, so you and others can easily read and understand the code.

Note

Remember our invalid token error from Chapter 2? In Python, for is another special token, and it can only be used for creating for loops. Tokens help translate what we type in our interpreters or scripts into something our computers can run and execute.

Try running the following example in your Python interpreter:

dogs = ['Joker', 'Simon', 'Ellie', 'Lishka', 'Fido']
for dog in dogs:
    print dog

With this for loop, we store each dog name in the for loop variable dog. For each of the iterations of our for loop, we print out the dog’s name (held in the variable dog). When the for loop has gone through each dog’s name (or item in the list), the code is finished running.

Closing Indented Code Blocks in IPython

When writing for loops or other indented blocks in an IPython terminal, make sure your prompt has returned from an indented block style ... to a new In prompt. The easiest way to do this is to simply hit Return after you are finished entering your final indented lines. You should see your prompt ask for another In before entering code outside of your loop:

In [1]: dogs = ['Joker', 'Simon', 'Ellie', 'Lishka', 'Fido']

In [2]: for dog in dogs:
   ...:     print dog   
   ...:                 
Joker
Simon
Ellie
Lishka
Fido

In [3]:

: IPython begins an auto-indented prompt (...: followed by four spaces).
: Press Return on the blank line to close the indented block and execute the code.
: A new prompt appears when IPython is done with the code.

In the case of the code we are using to read our CSV, our reader object is a Python container holding the data rows. In our reader for loop, we hold each row of data in the variable row. The next line states that for each row, we want Python to print the row:

for row in reader:
    print row

Now that we can import and loop through our data, we can really begin to explore it.

Saving the Code to a File; Running from Command Line

As you are working on code as a developer, you will want to save even partial bits of in-progress code to review and use later. If you are interrupted, being organized and saving your code means you can seamlessly pick up where you left off.

Let’s save the file that has our code up to this point, and run it. The code should look like this (if you haven’t already, open up your text editor, create a new file, and type this code into it):

import csv

csvfile = open('data-text.csv', 'rb')
reader = csv.reader(csvfile)

for row in reader:
    print row

Tip

Pay attention to capitalization, spacing, and new lines. Your code will not work if each line has different spacing or there are odd capitals. Write the code exactly as we have done, using four spaces to indent lines. This is important because Python is case-sensitive and uses indentation to indicate structure.

Save the code using your text editor as a .py (Python) file. The full filename should be something like import_csv_data.py.

Put the data file data-text.csv in the same folder where you just saved your Python file. If you want to put the file in another location, you will need to update the code appropriately for your new file location.

Opening Files from Different Locations

In the current code, we pass the path of the file to the open function like this:

open('data-text.csv', 'rb')

However, if our code was in a subfolder called code, we would need to modify the script to look there. That is, we would instead need to use:

open('data/data-text.csv', 'rb')

In the preceding example, we would have a file structure that looks like this:

data_wrangling/
`-- code/
    |-- import_csv_data.py
    `-- data/
         `-- data-text.csv

If you have trouble locating your file, open up your command line and use the following commands on your Mac or Linux machine to navigate through your folders:

ls returns a list of files.
pwd shows your current location.
cd ../ moves to the parent folder.
cd ../../ moves two levels up.
cd data moves into a folder called data that is inside the folder you are currently in (use ls to check!).

For more on navigating on the command line, including an entire section for Windows users, check out Appendix C.

After you save the file, you can run it using the command line. If you don’t already have it open, open your command line (Terminal or cmd), and navigate to where the file is located. Let’s assume that you put the file in ~/Projects/data_wrangling/code. To navigate there using the Mac command line, you would use the change directory or folder command (cd):

cd ~/Projects/data_wrangling/code

After you get to the right location, you can run the Python file. Up until this point, we were running our code in the Python interpreter. We saved the file as import_csv_data.py. To run a Python file from the command line, you simply type python, a space, and then the name of the file. Let’s try running our import file:

python import_csv_data.py

Your output should look like a bunch of lists—something like the data shown here, but with many more records:

['Healthy life expectancy (HALE) at birth (years)', 'Published', '2012',
 'Western Pacific', 'Lower-middle-income', 'Samoa', 'Female', '66',
 '66.00000', '', '', '']
['Healthy life expectancy (HALE) at birth (years)', 'Published', '2012',
 'Eastern Mediterranean', 'Low-income', 'Yemen', 'Both sexes', '54',
 '54.00000', '', '', '']
['Healthy life expectancy (HALE) at birth (years)', 'Published', '2000',
 'Africa', 'Upper-middle-income', 'South Africa', 'Male', '49', '49.00000',
 '', '', '']
['Healthy life expectancy (HALE) at birth (years)', 'Published', '2000',
 'Africa', 'Low-income', 'Zambia', 'Both sexes', '36', '36.00000', '', '', '']
['Healthy life expectancy (HALE) at birth (years)', 'Published', '2012',
 'Africa', 'Low-income', 'Zimbabwe', 'Female', '51', '51.00000', '', '', '']

Did you get this output? If not, stop for a minute to read the error you received. What does it tell you about where you might have gone wrong? Take time to search for the error and read a few ways people have fixed the same error. If you need extra help on troubleshooting how to get past the error, take a look at Appendix E.

Note

For a lot of our code from this point onward, we will do the work in a code editor, save the file, and run it from the command line. Your Python interpreter will still be a helpful tool to try out pieces of code, but as code gets longer and more complex it becomes harder to maintain in a code prompt.

With the current code we are writing, along with many other solutions we’ll write, there are often many ways to solve a problem. csv.reader() returns each new line of your file as a list of data and is an easy-to-understand solution when you begin. We are going to modify our script slightly to make our list rows into dictionary rows. This will make our data a little easier to read, compare, and understand as we explore our dataset.

In your text editor, take line 4, reader = csv.reader(csvfile), and update it to read reader = csv.DictReader(csvfile). Your code should now look like this:

import csv

csvfile = open('data-text.csv', 'rb')
reader = csv.DictReader(csvfile)

for row in reader:
    print row

When you run the file again after saving it, each record will be a dictionary. The keys in the dictionary come from the first row of the CSV file. All the subsequent rows are values. Here is an example row of output:

{
    'Indicator': 'Healthy life expectancy (HALE) at birth (years)',
    'Country': 'Zimbabwe',
    'Comments': '',
    'Display Value': '49',
    'World Bank income group': 'Low-income',
    'Numeric': '49.00000',
    'Sex': 'Female',
    'High': '',
    'Low': '',
    'Year': '2012',
    'WHO region': 'Africa',
    'PUBLISH STATES': 'Published'
}

At this point, we have successfully imported the CSV data into Python, meaning we were able to get the data from the file into a usable format Python can understand (dictionaries). Using a for loop helped us see the data so we could visually review it. We were able to use two different readers from the csv library to see the data in both a list and a dictionary form. We will be using this library again as we start exploring and analyzing datasets. For now, let’s move on to importing JSON.

JSON Data

JSON data is one of the most commonly used formats for data transfers. It is preferred, because it is clean, easy to read, and easy to parse. It is also one of the most popular data formats that websites use when transmitting data to the JavaScript on the page. Many sites have JSON-enabled APIs, which we will review in Chapter 13. For this section, we will continue using the worldwide life expectancy rates data. This data is not in available in JSON form from the WHO, but we have created a JSON version for the purposes of this book; it’s available in the code repository.

Note

If a file has a .json file extension, then it’s probably JSON data. If it has a .js file extension, it is probably JavaScript, but in some rare cases it might be a poorly named JSON file.

If you open the JSON file in your code editor, you will see that each data record looks a lot like a Python dictionary. There is a key and value for each row, separated by a :, with each entry separated by a ,. There are also opening and closing curly braces ({}). Here is a sample record from the JSON file:

[
  {
    "Indicator":"Life expectancy at birth (years)",
    "PUBLISH STATES":"Published",
    "Year":1990,
    "WHO region":"Europe",
    "World Bank income group":"High-income",
    "Country":"Andorra",
    "Sex":"Both sexes",
    "Display Value":77,
    "Numeric":77.00000,
    "Low":"",
    "High":"",
    "Comments":""
  },
]

Depending on the formatting, sometimes a JSON file can look exactly like a dictionary. In this sample, each entry is a Python dictionary (defined by the { and }) and those dictionaries are all held in a list, which is defined by the [ and ].

How to Import JSON Data

Importing a JSON file is even easier than importing a CSV file in Python. The following code will open, load, import, and print a JSON data file:

import json                                 

json_data = open('data-text.json').read()   

data = json.loads(json_data)                

for item in data:                           
    print item

: Imports the json Python library, which we will use to process the JSON.
: Uses the built-in Python function open to open the file. The filename is data-text.json (and this is the first argument for our open function). This line of code calls the open file’s read method, which will read the file and store it in the json_data variable.
: Uses json.loads() to load JSON data into Python. The data variable catches the output.
: Iterates over the data using a for loop and prints each item, which will be the output in this example.

If you run python import_json_data.py from the command line, the output will show a dictionary for each record in the JSON file. It should look almost exactly like the final CSV output. Remember to copy the data file into the folder or to change the file path to reflect it’s location.

At the end of the CSV section, we learned how to save a file and run it from the command line. With this example, let’s start with a blank file and build up to that.

For a quick overview:

Create a new file in your code editor.
Save it as import_json_data.py in the folder holding your code.
Move (or save) the data file to the folder where you are storing your code. (Make sure to rename the file so it matches the name you use in the code. The book uses data-text.json.)
Go back to your code editor, which should still have the import_json_data.py file open.

Let’s read through the code and compare it to the CSV import file. First, we import Python’s built-in json library:

import json

Then we use the open function we are already familiar with to open the file data-text.json and call the read method on the open file:

json_data = open('data-text.json').read()

In the CSV file, we did not call read. What is the difference? In the CSV example, we opened the file in read-only mode, but in the JSON example, we are reading the contents of the file into our json_data variable. In the CSV example, the open function returns an object that is a file, but in the JSON example, we first open and then read the file, so we have a str (string). This difference is based on the fact that the Python json and csv libraries handle input in different ways. If you tried to pass a string into the CSV reader, it would throw an error, and the JSON loads function would throw an error if you passed it a file.

The good news is that Python makes it easy to write a string to a file (e.g., if you needed to use the CSV reader but only had a string), and it’s also easy to read a file into a string. To Python, a closed file is just a filename string waiting to be opened and read. Getting data from a file, reading it into a string, and passing that string to a function requires just a few lines of Python code.

Tip

From the folder where you have the JSON file stored, you can type the following into your Python interpreter and see what kind of object type each version outputs:

filename = 'data-text.json'
type(open(filename, 'rb'))      # similar to csv code
type(open(filename).read())     # similar to json code

The Python json library’s loads function expects a string, not a file. The Python csv library’s reader function expects an open file. In the next line of our script, we will use the loads function, which loads a JSON string into Python. The output of this function is assigned to a variable called data:

data = json.loads(json_data)

To preview our data, we iterate over each item and print it out. This isn’t necessary for our code, but it helps us preview the data and make sure it’s in the proper form:

for item in data:
    print item

Once you are done writing your file you can save and run it. As you can see, opening and converting a JSON file to a list of dictionaries in Python is quite easy. In the next section, we will explore more customized file handling.

XML Data

XML is often formatted to be both human and machine readable. However, the CSV and JSON examples were a lot easier to preview and understand than the XML file for this dataset. Luckily for us, the data is the same, so we are familiar with it. Download and save the XML version of the life expectancy rates data in the folder where you are saving content associated with this chapter.

Note

If a file has an .xml file extension, then it’s XML data. If it has an .html or .xhtml file extension, it can sometimes be parsed using XML parsers.

As we do with all of our data, let’s open the file in a code editor to preview it. If you scroll through the file, you will see the familiar data we covered in the CSV example. The data looks different, though, because it’s presented in XML format, using things called tags.

Note

XML is a markup language, which means it has a document structure that contains formatted data. XML documents are essentially just specially formatted data files.

The following snippet is a sampling of the XML data we are working with. In this example, <Observation />, <Dim />, and <Display /> are all examples of tags. Tags (or nodes) store data in a hierarchical and structured way:

<GHO ...>
    <Data>
        <Observation FactID="4543040" Published="true"
        Dataset="CYCU" EffectiveDate="2014-03-27" EndDate="2900-12-31">
            <Dim Category="COUNTRY" Code="SOM"/>
            <Dim Category="REGION" Code="EMR"/>
            <Dim Category="WORLDBANKINCOMEGROUP" Code="WB_LI"/>
            <Dim Category="GHO" Code="WHOSIS_000002"/>
            <Dim Category="YEAR" Code="2012"/>
            <Dim Category="SEX" Code="FMLE"/>
            <Dim Category="PUBLISHSTATE" Code="PUBLISHED"/>
            <Value Numeric="46.00000">
                <Display>46</Display>
            </Value>
        </Observation>
        <Observation FactID="4209598" Published="true"
        Dataset="CYCU" EffectiveDate="2014-03-25" EndDate="2900-12-31">
            <Dim Category="WORLDBANKINCOMEGROUP" Code="WB_HI"/>
            <Dim Category="YEAR" Code="2000"/>
            <Dim Category="SEX" Code="BTSX"/>
            <Dim Category="COUNTRY" Code="AND"/>
            <Dim Category="REGION" Code="EUR"/>
            <Dim Category="GHO" Code="WHOSIS_000001"/>
            <Dim Category="PUBLISHSTATE" Code="PUBLISHED"/>
            <Value Numeric="80.00000">
                <Display>80</Display>
            </Value>
        </Observation>
    </Data>
</GHO>

Data values can be stored in two places in an XML file: either in between two tags, as in <Display>46</Display>, where the value for the <Display> tag is 46; or as an attribute of a tag, as in <Dim Category="COUNTRY" Code="SOM"/>, where the value of the Category attribute is "COUNTRY" and the value of the Code attribute is "SOM". XML attributes store extra information for a particular tag, nested inside a single tag.

Whereas in JSON you might store data in key/value pairs, in XML you can store data in pairs or groups of threes or fours. The XML tags and attributes hold data, similar to the JSON keys. So when we look again at the Display tag, the value for that tag is held within the opening and closing portions of the tag. When we view the Dim node, we see it has two different attributes with values (Category and Code). XML formatting allows us to store more than one attribute in each node. For those of you familiar with HTML, this should look familiar. That’s because HTML is closely related to XML; they both carry attributes inside nodes (or tags), and they are both Markup languages.

Warning

Although there are some well-known standards for forming XML tags and naming attributes, much of the structure is dictated by the person (or machine) who designs or creates the XML. You can never assume consistency if you’re using datasets from different sources. For more reading on XML best practices, IBM has provided some great talking points.

How to Import XML Data

Because we have an understanding of the data, let’s import the file into a usable form for Python. To get the data out of XML form and into Python, we will write the following code:

from xml.etree import ElementTree as ET

tree = ET.parse('data-text.xml')
root = tree.getroot()

data = root.find('Data')

all_data = []

for observation in data:
    record = {}
    for item in observation:

        lookup_key = item.attrib.keys()[0]

        if lookup_key == 'Numeric':
            rec_key = 'NUMERIC'
            rec_value = item.attrib['Numeric']
        else:
            rec_key = item.attrib[lookup_key]
            rec_value = item.attrib['Code']

        record[rec_key] = rec_value

    all_data.append(record)

print all_data

As you can see, this is a little more complicated than the CSV and JSON examples.

Let’s take a closer look. Create a new file in your code editor and save it to the folder where you have been putting your code. Name it import_xml_data.py. Also, if you downloaded the data directly from the WHO site and not the book’s repository, rename the saved XML file data-text.xml and put it in the same folder as your new code.

First, let’s import ElementTree, part of the built-in library we are going to use to parse the XML:

from xml.etree import ElementTree as ET

Tip

As mentioned earlier, there are often multiple solutions to a problem. While we use ElementTree in this example, there is another library you could use called lxml, and yet another called minidom.

Because all three of these solutions can be used to solve the same problem, if you find a good example using one of the libraries, we encourage you to explore the data using another library as well. As you learn Python, choose libraries that seem easiest for you to understand (it just so happens that in many cases these are the best choices).

This import statement has an extra component we did not have last time: as ET. We are importing ElementTree, but will refer to it as ET. Why? Because we are lazy and do not want to type out ElementTree every time we want to use the library. This is a common practice when importing classes or functions with long names, but it is not mandatory. The as tells Python we want to use ET to represent ElementTree.

Next, we call the parse method on the ET class, which will parse data from the filename we pass. Because we are parsing a file located in the same folder, the filename needs no file path:

tree = ET.parse('data-text.xml')

The parse method returns a Python object people normally store in a variable tree. When talking about XML, the tree is the whole XML object stored in a way Python can understand and parse.

To understand how to traverse the tree (and data contained therein), we begin at the root of the tree. The root is the first XML tag. To start at the root of the tree, we call the getroot function:

root = tree.getroot()

If you were to take a moment to print out root by adding print root after the previous statement, you would find it prints out the Python representation of the root element in the XML tree (it should look like <Element 'GHO' at 0x1079e79d0>²). From this representation, we can quickly tell ElementTree identified the root or outermost tag of the XML document as an XML node with a tag name GHO.

Now that we’ve identified the root, we need to figure out how to access the data we want. After analyzing the data in this chapter’s CSV and JSON sections, we know what data we’d like to review. We need to traverse the XML tree and extract that same data. In order to understand what we’re looking for, we need to understand the overall structure and format of the XML tree. Here, we’ve slimmed down the XML file we are using and removed the data, so we can view just the core structure:

<GHO>
    <Data>
        <Observation>
            <Dim />
            <Dim />
            <Dim />
            <Dim />
            <Dim />
            <Dim />
            <Value>
                <Display>
                </Display>
            </Value>
        </Observation>
        <Observation>
            <Dim />
            <Dim />
            <Dim />
            <Dim />
            <Dim />
            <Dim />
            <Value>
                <Display>
                </Display>
            </Value>
        </Observation>
    </Data>
</GHO>

In reviewing this structure, we can see each “row” of data is held in an Observation tag. The data for each of these rows is then held in the Dim, Value, and Display nodes within each Observation node.

So far we have three lines of code. To investigate how we can use Python to extract these nodes, let’s add print dir(root) to the end of our current code, then save the file and run it on the command line:

python import_xml_data.py

You will see all the methods and properties of the root variable. Your code should look like this:

from xml.etree import ElementTree as ET

tree = ET.parse('data-text.xml')
root = tree.getroot()

print dir(root)

When you run the file, you should see this output:

['__class__', '__delattr__', '__delitem__', '__dict__', '__doc__',
'__format__', '__getattribute__', '__getitem__', '__hash__', '__init__',
'__len__', '__module__', '__new__', '__nonzero__', '__reduce__',
'__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__',
'__str__', '__subclasshook__', '__weakref__', '_children', 'append', 'attrib',
'clear', 'copy', 'extend', 'find', 'findall', 'findtext', 'get',
'getchildren', 'getiterator', 'insert', 'items', 'iter', 'iterfind',
'itertext', 'keys', 'makeelement', 'remove', 'set', 'tag', 'tail', 'text']

Let’s assume our file is too large for us to open, and that we don’t know the structure of the file. This is often the case with larger XML datasets. What can we do? Let’s start by calling dir(root) to review the root object’s methods. We notice the getchildren method as a potential solution for seeing children in the Observation nodes. After reading the latest documentation and a question on Stack Overflow, we find the getchildren method will return the subelements, but the method has been deprecated. When a method you want to use is or will soon become deprecated, you should attempt to use what the authors of the library suggest as a replacement.

Tip

When a method, class, or function is deprecated, it means this functionality will likely be removed from future releases of the library or module. For this reason, you should always avoid using deprecated methods or classes and make sure you read through the documentation, as the authors have likely recommended what to use going forward.

Based on what the documentation is recommending, if we want to view subelements for the root tree, we should use list(root). If we have a very large file, returning direct subelements will give us a view of the data and its structure, without overwhelming us with too many lines of output. Let’s try that.

Replace this line:

print dir(root)

with this line:

print list(root)

Run the file again from your command line. You should end up with the following output, which is a list of Element objects (for our purposes, elements are XML nodes):

[<Element 'QueryParameter' at 0x101bfd290>,
<Element 'QueryParameter' at 0x101bfd350>,
<Element 'QueryParameter' at 0x101bfd410>,
<Element 'QueryParameter' at 0x101bfd590>,
<Element 'QueryParameter' at 0x101bfd610>,
<Element 'QueryParameter' at 0x101bfd650>,
<Element 'Copyright' at 0x101bfd690>,
<Element 'Disclaimer' at 0x101bfd710>,
<Element 'Metadata' at 0x101bfd790>,
<Element 'Data' at 0x102540250>]

The list contains Element objects called QueryParameter, Copyright, Disclaimer, Metadata, and Data. We can traverse these elements and explore the contents so we better understand how to extract the data we are after.

When searching for data stored in an XML tree, the Data element is likely a good place to start. Now we have found the Data element, and we can focus on that subelement. There are a couple of ways to get the Data element, but we will use the find method. The root element’s find method allows us to search for a subelement using the tag name. Then, we will get the Element’s children and see what we should do next.

Replace this line:

print list(root)

with this:

data = root.find('Data')

print list(data)

Tip

There is also a findall method we could use. The difference between find and findall is that find will return the first matching element, while findall will return all of the matching elements. We know there is only one Data Element, so we can use find instead of findall. If there was more than one Element, we would want to use the findall method to get the whole list of matching elements and iterate over them.

When you rerun your file with the new lines of code, you will see a dizzying output of a list of Observation elements. These are our data points. Although there is a lot of information output, you can tell it is a list because the last character is a ], which symbolizes the end of a list.

Let’s iterate over the list of data. Each Observation represents a row of data, so they should have more information contained inside. We can iterate over those elements individually to see what subelements exist. With a Python Element objects, we can iterate over all of the subelements similar to how we would a list. Therefore, we can iterate over each Observation and iterate over each subelement in the Observation elements. This is our first time using a loop within a loop, and it should show us whether there are more subelements we can use to extract data.

Tip

Because XML stores data in nodes, subnodes, and attributes, you’ll often want to take the approach of exploring each node and subnode (or element and subelement) until you get a good idea of not only how the data is structured but also how Python sees the data.

Replace this line:

print list(data)

with these lines:

for observation in data:
    for item in observation:
        print item

and rerun the file.

The output is a bunch of Dim and Value objects. Let’s try a couple of different ways to explore what might be contained in these elements. There are a few ways you can view data within Python’s Element object. One of the attributes of every Element node is text, which displays the text contained inside the node.

Replace this line:

print item

with this line:

print item.text

and rerun the file.

What happened? You should have gotten a lot of None values back. This is because item.text does not exist for those elements because there is no text between the elements’ tags. Look at how <Dim /> is structured in the data samples. For example:

<Dim Category="YEAR" Code="2000"/>

In Python, item.text is only useful if your elements are structured with text in the nodes, like the following:

<Dim Category="YEAR">2000</Dim>

For the second example, item.text returns 2000.

XML data can be structured in many different ways. The information we need is located in the XML; it just wasn’t in the first place we looked. Let’s continue to explore.

Another place to look is in child elements. Child elements are subelements of a parent element. Let’s check if we have any child elements. Replace this line:

print item.text

with this line:

print list(item)

When you rerun the code with that change, the output shows that some (but not all) elements have children. That’s interesting! These are actually Value elements. Look at how these are structured in the data sample:

<Value>
    <Display>
    </Display>
</Value>

If we want to explore those child elements, we’ll need another loop similar to the loop we wrote to go through the items in each Observation.

There is another method we can call for each Element object in Python, called attrib, which returns the attributes for each node. As we know from reviewing XML structure, if nodes don’t have values between the tags, they usually have attributes within the tags.

To see what attributes we have in our nodes, replace:

print list(item)

with this line:

print item.attrib

When we rerun the code, we see the data contained in the attributes output as a bunch of dictionaries. Rather than storing each element and the attributes in separate dictionaries, we want each full row of data in a dictionary together. Here is one record from our attrib output:

{'Category': 'PUBLISHSTATE', 'Code': 'PUBLISHED'}
{'Category': 'COUNTRY', 'Code': 'ZWE'}
{'Category': 'WORLDBANKINCOMEGROUP', 'Code': 'WB_LI'}
{'Category': 'YEAR', 'Code': '2012'}
{'Category': 'SEX', 'Code': 'BTSX'}
{'Category': 'GHO', 'Code': 'WHOSIS_000002'}
{'Category': 'REGION', 'Code': 'AFR'}
{'Numeric': '49.00000'}

Because we ended up with a dictionary for every record in our CSV example, let’s try to put this output in a similar form. The keys in our XML data dictionary will be slightly different, because the WHO does not provide the same data in the XML dataset as in the CSV dataset. We will work to get our data in the following form, but ignore the difference in key names. Ultimately, this will not affect how we use the data.

Just to remind you, here is a sample record from the CSV reader:

{
    'Indicator': 'Healthy life expectancy (HALE) at birth (years)',
    'Country': 'Zimbabwe',
    'Comments': '',
    'Display Value': '51',
    'World Bank income group': 'Low-income',
    'Numeric': '51.00000',
    'Sex': 'Female',
    'High': '',
    'Low': '',
    'Year': '2012',
    'WHO region': 'Africa',
    'PUBLISH STATES': 'Published'
}

Here is the goal for a sample record using our XML data. We aim to have it in this format by the time we are done parsing our XML tree:

{
    'COUNTRY': 'ZWE',
    'GHO': 'WHOSIS_000002',
    'Numeric': '49.00000',
    'PUBLISHSTATE': 'PUBLISHED',
    'REGION': 'AFR',
    'SEX': 'BTSX',
    'WORLDBANKINCOMEGROUP': 'WB_LI',
    'YEAR': '2012'
 }

Notice the High and Low fields are missing. If they existed in our XML dataset, we would add them to the keys of our new dictionary. The Display Value is also missing. We decided not to include it, as it’s the same as the Numeric value.

Currently, your code should look like this:

from xml.etree import ElementTree as ET

tree = ET.parse('data-text.xml')
root = tree.getroot()

data = root.find('Data')

for observation in data:
    for item in observation:
        print item.attrib

To create the data structure, we need to first create an empty dictionary for each record. We will use this dictionary to add keys and values, then we will append each record to a list, so we have a final list of all of our records (similar to our CSV data).

Let’s add our empty dictionary and an empty list to hold our data in. Add all_data = [] on a new line above the outer for loop and record = {} as the first line in the for loop, like so:

all_data = []

for observation in data:
    record = {}
    for item in observation:
        print item.attrib

Now we need to figure out what our key and values are for each line and add them to our record’s dictionary. For each attrib call, we get a dictionary with one or more value and key combinations returned, like this:

{'Category': 'YEAR', 'Code': '2012'}

It looks like the value of the Category key (here, YEAR) should be the key for our dictionary, and the value of Code (here, 2012) should be set as the value for that key. As you’ll recall from Chapter 2, a dictionary key should be easily used for lookup (like YEAR) and dictionary value should contain the value associated with that key (like 2012). With that knowledge, the preceding line would become:

'YEAR': '2012'

Update print item.attrib to print item.attrib.keys() in your code, then rerun it:

for item in observation:
    print item.attrib.keys()

It will output the keys of each attribute dictionary. We want to check the keys so we can form the keys and values of the new item dictionary. We end up with two different outputs: ['Category', 'Code'] and ['Numeric']. Let’s tackle one at a time. Based on our initial investigation, for the elements with both Category and Code, we know we need to use the Category values as keys and the Code values as values.

To do this, add [0] to the end of item.attrib.keys():

for item in observation:
    lookup_key = item.attrib.keys()[0]
    print lookup_key

This is called indexing. It will return the first item in the list.

Working with List Indexes

Indexing a list or other iterable object in Python means you are taking the nth object from that list. In Python, indexing begins with a 0, meaning that the first item is numbered 0, the second item uses 1, and so on. Because our dictionary has either one or two items and we want to grab the first item, we add [0].

Take our dog example from the previous chapter:

dog_names = ['Joker', 'Simon', 'Ellie', 'Lishka', 'Fido']

If you want to pull out Ellie from the list, then you want the third item in the list. Because the index count starts at 0, you would pull this item out in the following way:

dog_names[2]

Try this in your Python interpreter, then try to pull out Simon. What happens if you put a negative number as the index? (Hint: a negative index counts backwards!)

If we rerun the code and look at the output, we now have the following:

Category
Category
Category
Category
Category
Category
Category
Numeric

Now that we have the key names, we can look up the values. We need the values of the Category key to use as the keys in our new dictionary. Update the inner for loop by creating a new variable, rec_key, which stores the value returned from item.attrib[lookup_key]:

for item in observation:
    lookup_key = item.attrib.keys()[0]
    rec_key = item.attrib[lookup_key]
    print rec_key

With these updates, rerun the code from your command line. For each record, we get the following values:

PUBLISHSTATE
COUNTRY
WORLDBANKINCOMEGROUP
YEAR
SEX
GHO
REGION
49.00000

These all look like great keys for our new dictionary, except for the last one. This is because the last element is a Numeric dictionary instead of the Category ones we have been working with. If we want to retain that data for our use, we need to set up a special case for those numeric elements using an if statement.

Python’s if Statement

In its most basic form, the if statement is a way to control the flow of your code. When you use an if statement, you are telling the code: if this condition is met, then do something particular.

Another way to use if is followed with an else. An if-else statement says: if the first condition is met, then do something, but if it is not, then do what is in the else statement.

Besides if and if-else, you will see == as a comparison operator. While = sets a variable equal to a value, == tests to see if two values are equal. In addition, != tests to see if they are not equal. Both of these operators return Boolean values: True or False.

Try the following examples in your Python interpreter:

x = 5

if x == 5:
    print 'x is equal to 5.'

What did you see? x == 5 will return True, and so the text will have printed. Now try:

x = 3

if x == 5:
    print 'x is equal to 5.'
else:
    print 'x is not equal to 5.'

Because x equals 3 and not 5 in this example, you should have received the print statement in the else block of code. You can use if and if-else statements in Python to help guide the logic and flow of your code.

We want to see when lookup_key is equal to Numeric, and use Numeric as the key instead of the value (like we did with the Category keys). Update your code with the following:

for item in observation:

    lookup_key = item.attrib.keys()[0]

    if lookup_key == 'Numeric':
        rec_key = 'NUMERIC'
    else:
        rec_key = item.attrib[lookup_key]

    print rec_key

If you run your updated code, all of your keys should now look like keys. Now, let’s pull out the values we want to store in our new dictionary and associate them with those keys. In the case of Numeric, it’s simple, because we just want the Numeric key’s value. Make the following changes to your code:

    if lookup_key == 'Numeric':
        rec_key = 'NUMERIC'
        rec_value = item.attrib['Numeric']
    else:
        rec_key = item.attrib[lookup_key]
        rec_value = None

    print rec_key, rec_value

If you run the updated code, you will see the rec_value for Numeric is properly matched. For example:

NUMERIC 49.00000

For all other values, we set the rec_value to None. In Python, None is used to represent a null value. Let’s populate those with real values. Remember each record has a Category and a Code key, like so: {'Category': 'YEAR', 'Code': '2012'}. For these elements, we want to store the Code value as the rec_value. Update the line rec_value = None, so your if-else statement looks like the one shown here:

    if lookup_key == 'Numeric':
        rec_key = 'NUMERIC'
        rec_value = item.attrib['Numeric']
    else:
        rec_key = item.attrib[lookup_key]
        rec_value = item.attrib['Code']

    print rec_key, rec_value

Rerun the code, and you should now see that we have values for our rec_key and our rec_value. Let’s build the dictionary:

    if lookup_key == 'Numeric':
        rec_key = 'NUMERIC'
        rec_value = item.attrib['Numeric']
    else:
        rec_key = item.attrib[lookup_key]
        rec_value = item.attrib['Code']

    record[rec_key] = rec_value

: Adds each key and value to the record dictionary.

We also need to add each record to our all_data list. As we saw in “List Methods: Things Lists Can Do”, we can use the list’s append method to add values to our list. We need to append each record at the end of the outer for loop, as that is when it will have all of the keys for each of the subelements. Finally, we will add a print at the end of the file, to show our data.

Your full code to transform the XML tree to a dictionary should look like this:

from xml.etree import ElementTree as ET

tree = ET.parse('data-text.xml')
root = tree.getroot()

data = root.find('Data')

all_data = []

for observation in data:
    record = {}
    for item in observation:

        lookup_key = item.attrib.keys()[0]

        if lookup_key == 'Numeric':
            rec_key = 'NUMERIC'
            rec_value = item.attrib['Numeric']
        else:
            rec_key = item.attrib[lookup_key]
            rec_value = item.attrib['Code']

        record[rec_key] = rec_value

    all_data.append(record)

print all_data

Once you run the code, you will see a long list with a dictionary for each record, just like in the CSV example:

{'COUNTRY': 'ZWE', 'REGION': 'AFR', 'WORLDBANKINCOMEGROUP': 'WB_LI',
'NUMERIC': '49.00000', 'SEX': 'BTSX', 'YEAR': '2012',
'PUBLISHSTATE': 'PUBLISHED', 'GHO': 'WHOSIS_000002'}

As you can see, extracting data from the XML was a little more complicated. Sometimes CSV and JSON files won’t be as easy to process as they were in this chapter, but they are usually more easily processed than XML files. However, looking at the XML data allowed you to explore and grow as a Python developer, giving you a chance to create empty lists and dictionaries and populate them with data. You also honed your debugging skills as you explored how to extract data from the XML tree structure. These are valuable lessons in your quest to become a better data wrangler.

Summary

Being able to handle machine-readable data formats with Python is one of the must-have skills for a data wrangler. In this chapter, we covered the CSV, JSON, and XML file types. Table 3-2 provides a reminder of the libraries we used to import and manipulate the different files containing the WHO data.

Table 3-2. File types and file extensions
File type	File extensions	Python library
CSV, TSV	.csv, .tsv	`csv`
JSON	.json, .js	`json`

We also covered a few new Python concepts. At this point, you should know how to run Python code from the Python interpreter and how to save the code to a new file and run it from the command line. We also learned about importing files using import, and how to read and open files with Python on your local filesystem.

Other new programming concepts we covered include using for loops to iterate over files, lists, or trees and using if-else statements to determine whether a certain condition has been met and to do something depending on that evaluation. Table 3-3 summarizes those new functions and code logic you learned about here.

Table 3-3. New Python programming concepts
Concept	Purpose
`import`	Imports a module into the Python space
`open`	Built-in function that opens a file in Python on your system
`for` loop	A piece of code that runs n times
if-else statement	Runs a piece of code if a certain condition is met
`==` (equal to operator)	Tests to see if one value is equal to another
Indexing a sequence	Pulls out the nth object in the sequence (string, list, etc.)

Lastly, in this chapter we started to create and save a lot of code files and data files. Assuming you did all the exercises in this chapter, you should have three code files and three data files. Earlier in the chapter, there was a recommendation for how to organize your code. If you have not done this already, do it now. Here is one example of how to organize your data so far:

data_wrangling/
    code/
        ch3_easy_data/
            import_csv_data.py
            import_xml_data.py
            import_json_data.py
            data-text.csv
            data-text.xml
            data-json.json
        ch4_hard_data/
            ...

Now, on to harder data formats!

¹ To complete the exercises in this chapter, you will need a good text editor. If you haven’t already installed one, follow the instructions in “Install a Code Editor”.

² When using Python objects with a longer numeric and hex string, this is Python’s way of showing memory address information. It’s not necessary for our data wrangling needs so please ignore it if your memory addresses don’t look like ours.

Previous Chapter

2. Python Basics

Next Chapter

4. Working with Excel Files

Table of Contents for Data Wrangling with Python

Chapter 3. Data Meant to Be Read by Machines

Note

CSV Data

Note

How to Import CSV Data

Note

Tip

Note

Saving the Code to a File; Running from Command Line

Tip

Note

JSON Data

Note

How to Import JSON Data

Tip

XML Data

Note

Note

Warning

How to Import XML Data

Tip

Tip

Tip

Tip

Summary

Table of Contents for
Data Wrangling with Python