Chapter 8. Data Cleanup: Standardizing and Scripting

You’ve learned how to match, parse, and find duplicates in your data, and you’ve started exploring the wonderful world of data cleanup. As you grow to understand your datasets and the questions you’d like to answer with them, you’ll want to think about standardizing your data as well as automating your cleanup.

In this chapter, we’ll explore how and when to standardize your data and when to test and script your data cleanup. If you are managing regular updates or additions to the dataset, you’ll want to make the cleanup process as efficient and clear as possible so you can spend more time analyzing and reporting. We’ll begin by standardizing and normalizing your dataset and determining what to do if your dataset is not normalized.

Normalizing and Standardizing Your Data

Depending on your data and the type of research you are conducting, standardizing and normalizing your dataset might mean calculating new values using the values you currently have, or it might mean applying standardizations or normalizations across a particular column or value.

Normalization, from a statistical view, often has to do with calculating new values from a dataset to standardize the data on a particular scale. For example, you might need to normalize scores for a test to scale so you can accurately view the distribution. You might also need to normalize data so you can accurately see percentiles, or percentiles across different groups (or cohorts).

Say you want to see the distribution of a team’s scores across a given season. You might first categorize them as wins, losses, and ties. You might then categorize those as points over, points under, and so on. You might also categorize minutes played and scores per minute. You have access to all of these datasets, and now you’d like to compare them across teams. If you wanted to normalize them, you might normalize total scores on a 0–1 scale. This would mean the outliers (top scores) would be close to 1 and the low scores would be close to 0. You could then use the distribution of that new data to see how many scores were in the middle, and if there were a lot in the low or high range. You could also identify outliers (i.e., if most of the scores are now between .3 and .4, then you know ones that don’t fall into that range might be outliers).

What if you want to use standardization on that same data? For example, you could standardize the data and calculate the average number of scores per minute. Then you could chart your averages and see the distribution. What teams scored more per minute? Are there outliers?

You could also calculate the distribution by looking at standard deviations. We’ll cover standardization more fully in Chapter 9, but you are basically asking, what is the normal range of the data and what is outside of that range? Does the data follow a pattern or not?

As you can see, normalization and standardization are not the same thing. However, they do often allow researchers or investigators to determine the distribution of their data and what that means for future research or calculations.

Sometimes standardizing and normalizing your data also requires removing outliers so you can better “see” the patterns and distribution of the data. Continuing with the same sports analogy, if you remove the top-scoring players’ scores from the entire league, does it drastically change the way the teams perform? If one player is responsible for half of his or her team’s scores, then yes, this could cause a dramatic shift.

Likewise, if one team always won by a significant number of points, removing that team from the league data might dramatically shift the distribution and averages. Depending on what problem you are trying to solve, you can use normalization, standardization, and removal of outliers (trimming your data) to help find the answers to your questions.

Saving Your Data

We’ve explored a few ways to save your data, so let’s review them now that we have data we can use. If you are utilizing a database and already know how you’d like to format your tables and save the data you have already cleaned, you should continue to use the Python library modules we reviewed in Chapter 6 to connect and save your data. For many of these libraries, you can open a cursor and commit directly to the database.

Tip

We highly recommend adding failure messages and catches in your database scripts, in case you experience a network or database failure. We recommend committing frequently to help avoid network or latency issues affecting your script.

If you are using the SQLite example we reviewed in Chapter 6, you’ll want to save your new clean data into your database. Let’s take a look at how we can go about that:

import dataset

db = dataset.connect('sqlite:///data_wrangling.db') 1

table = db['unicef_survey'] 2

for row_num, data in enumerate(zipped_data): 3
    for question, answer in data: 4
        data_dict = { 5
            'question': question[1], 6
            'question_code': question[0],
            'answer': answer,
            'response_number': row_num, 7
            'survey': 'mn',
        }

    table.insert(data_dict) 8
1

Here, we access our local database. If you stored the file in a different directory, make sure to change the file path to reflect the database file location in relation to your current file path (e.g., if it’s stored in the parent directory: file:///../datawrangling.db).

2

This line creates a new table, unicef_data. Because we know many UNICEF surveys follow this pattern, this is a safe, clear name for our database.

3

We want to keep track of what row we are on so we have a number per response. This code uses the enumerate function so each entry (of each row/response) is easily linked in the database (they share a row number).

4

We know our data is broken into tuples, with our headers as a list in the first entry in the tuple and the responses to those questions in the second part of the tuple. This code uses a for loop so we can parse and save the data contained therein.

5

Each question and answer has its own entry in our database, so we can join together all of the responses for each row (i.e., interview). This code creates a dictionary with the necessary data for each response for each interview.

6

The plainly written question is the second entry in the list of the headers. This code saves that data as question and the UNICEF short code as question_code.

7

In order to keep track of each row of responses, or interview, this code includes the row_num from enumerate.

8

Finally, we insert our newly assembled dictionary into our database using our new table’s insert method.

We want to make sure our cleaned data is preserved in our SQLite database. We created a new database, using the enumerate function so we could join together each response (row). If we need to access our data, we can access our new table and use the functions described in Chapter 6 to see all of our records and retrieve them as needed.

If you instead would like to export the cleaned data to a simple file, it should be easy to do that as well. Let’s take a look:

from csv import writer


def write_file(zipped_data, file_name):
    with open(file_name, 'wb') as new_csv_file: 1
        wrtr = writer(new_csv_file) 2
        titles = [row[0][1] for row in zipped_data[0]] 3
        wrtr.writerow(titles) 4
        for row in zipped_data:
            answers = [resp[1] for resp in row] 5
            wrtr.writerow(answers)


write_file(zipped_data, 'cleaned_unicef_data.csv') 6
1

Uses with...as to assign the first output to the second variable name. This assigns the new file open(file_name, 'wb') creates to the variable new_csv_file. 'wb' means write in binary mode.

2

Initializes our CSV writer object by passing it an open file, and assigns the writer object to the wrtr variable.

3

Creates a list of the titles for the header row, since the writer object needs a list of data to write row by row. The longer titles reside in the second element of the first part of our tuple, so the code uses row[0][1].

4

Uses the writer object’s writerow method, which takes an iterable and turns it into a comma-separated row. This line writes the header row.

5

Uses list comprehension to pull out the responses (the second value in the tuple).

6

Writes each of the lists or responses created using list comprehension to our comma-separated data file.

Here we’ve used some new and some old syntax. We have learned how to use with... as to take the response of a simple function and assign it to a variable name. Here, we want to take our open file and assign it to our new_csv_file variable. This type of syntax is often used with files and other I/O objects, because once Python is done executing the code in the with block we created, it will close the file for us—bellissimo!

Additionally in this code, we’ve used our CSV writer, which operates similarly to our CSV reader. writerow helps us write a list containing each column of data to our CSV.

Warning

The writerow method expects an iterable object, so make sure you always pass a list or a tuple. If you pass a string, don’t be surprised to see some interesting CSVs (“l,i,k,e, ,t,h,i,s”).

We’ve also used list comprehensions to help make lists of both titles and answers. Because we don’t expect a new object or modified object from this function, we simply don’t return anything. This function is good review of some of the concepts we’ve learned so far.

If you’d rather save it in another way, refer to the tips we gave in Chapter 6 on how to save your data. Once your cleaned data is saved, you can move onto securing the rest of your cleanup process and analyzing your data.

Determining What Data Cleanup Is Right for Your Project

Depending on the reliability of your data and how often you will be analyzing it, you might choose a different path for data cleanup. If the data you are working with is haphazard or has many different sources, you might not be able to accurately script your cleanup.

Tip

You need to analyze how much time and effort it takes to completely script your cleanup and whether automating the cleanup will actually save you time.

If your cleanup is particularly onerous and involves many steps, you might want to create a repository of helper scripts. This can give you many functions to reuse throughout your data wrangling and enables you to process new data faster, even if you can’t write a script that goes through all of the steps in order. For example, you can have some scripts to search a list or matrix for duplicates, and some functions to export or import data to/from CSVs or format strings and dates. With this type of solution, you can import those functions and use them as you wish with IPython or Jupyter (which we’ll learn about in Chapter 10), or in other files in the repository.

If your cleanup code instead matches a determinate pattern and is unlikely to change, you can probably script the entire cleanup process.

Scripting Your Cleanup

As your Python knowledge deepens and grows, so will the Python you write. You are now able to write functions, parse files, import and use various Python libraries, and even store your data. It’s about time to start really scripting your code—and by that, we mean deciding how to structure your code for future use, learning, and sharing.

Let’s take our UNICEF data as an example. We know UNICEF will release these datasets every few years, and many of the data points will remain the same. It’s unlikely the survey will change significantly—it’s built on years of experience. Given these facts, we can likely depend on a fairly high level of consistency. If we are using UNICEF data again, we can probably reuse at least some of the same code we wrote for this first script.

We currently don’t have a lot of structure and certainly are missing code documentation. On top of making our code hard to read, this makes it difficult to reuse. Although our functions make sense to us now, could we accurately read and understand them in a year? Could we pass them to a coworker and expect our notes to be understood? Until we can answer these questions affirmatively, we may as well not have written any code. If we can’t read our code in a year’s time, it will be of no use, and someone (probably us) will have to rewrite it again when the new reports are released.

The Zen of Python applies not only to how you write your code, but to how you organize your code; name your functions, variables, and classes; and so on. It’s a good idea to take some time to consider your naming choices and determine what will be both clear to you and clear to others. Comments and documentation can help; however, your code should be somewhat legible on its own.

Note

Python is often heralded as one of the easiest to read languages, even for those who can’t read code! Keep your syntax readable and clean, and then your documentation will not have to go to great lengths to explain what your code is doing.

Familiarize yourself with the PEP-8 Python Style Guide and stick to those rules. There are plenty of PEP-8 linters which read through your code and point out issues.

In addition to style standards and uses, you can use linters for evaluating code complexity. There are several that analyze code according to McCabe’s theories and calculations on cyclomatic complexity. Although you may not be able to break your code into simple chunks every time, you should aim to break down complex tasks into smaller, simpler ones and make your code less complicated and more explicit.

As you work to make your code clear and obvious, it’s also useful to make reusable chunks more generic. Beware of becoming too generic (def foo will help no one), but if you build generic helper functions you will reuse often (like making a CSV from a list or making a set from a list of duplicates), your code will be more organized, clean, and simple.

Tip

If all of your reports use the same code to connect with your database or to open a data file, you can make this a function. As you write generic helper functions, your goal is to create simple code that is readable, usable, and doesn’t repeat itself.

Table 8-1 summarizes some coding best practices to think about as you go forward. These best practices don’t cover everything you’ll end up doing with Python and coding, but they provide a good foundation for further scripting and learning.

Table 8-1. Python coding best practices
Practice Description

Documentation

Include comments, function descriptions, and script clarifications throughout the code, as well as README.md files or any other necessary description in the repository structure.

Clear naming

All functions, variables, and files should have clear names that make their contents or intended use obvious.

Proper syntax

Variables and functions should follow proper Python syntax (generally lowercase with underscores between words, or CamelCase for class names) and the code should follow PEP-8 standards.

Imports

Only import what you need and use, and follow PEP-8 guidelines for your import structure.

Helper functions

Create abstract helper functions to make your code clear and reusable (e.g., export_to_csv to take a list and write a CSV export).

Repository organization

Organize your repository into a logical and hierarchical structure, so code used together is organized together and follows normal logical patterns.

Version control

All code should be under version control, so you or your colleagues can create new branches, try out new features, and still have a working master version of the repository.

Fast but clear

Use the syntactic sugar of Python to write fast and efficient code, but err on the side of clarity if the two are opposed.

Use libraries

When you need to do something someone has already coded in Python, don’t reinvent the wheel. Use good libraries and contribute to them to help the open source community.

Test your code

When applicable and possible, test your code by using test example data and writing tests for your individual functions.

Be specific

Use proper exceptions in your try blocks, be specific in your documentation, and use specific variable names.

Documenting your code is an essential part of writing any script. As Eric Holscher, Pythonista and cofounder of Write the Docs, aptly summarizes, there are many great reasons to document your code, the first being that you will probably need to use it again—or others may need to read and use it, or you may want to post it on GitHub, or you may want to use it in a future job interview, or you may want to send it to your mom. Whatever the reason, having documentation with, in, and throughout your code will save you hours of pain later. If you are a member of a team, it will save your team hundreds of hours of pain. The payoff is worth the effort of sitting down to analyze what your code is attempting to do and why.

There is a lot of great advice and help available from organizations like Read the Docs and Write the Docs to make writing documentation easier. A good rule of thumb is to have a README.md in the root of your project directory with a brief rundown of what the code does, how one can install and run it, what are the requirements, and where to find out more information.

Tip

Sometimes a small code sample or example in your README.md is also useful, depending on how much interaction the user (reader) is going to have with the core components.

In addition to your README.md file, you’ll want to add code comments. As we saw in Chapter 5, these can vary from quick notes to yourself to longer comments documenting how to use your scripts and functions.

Note

Thanks to PEP-350, the syntax and use of different types of commenting in Python are well documented. Following these standards will make your comments easy for everyone to understand.

Let’s take a stab at documenting what we’ve been doing in our cleanup chapters. To get our creative documentation juices flowing, we’ll start with a simple list of the tasks we set out to complete:

  • Import data from UNICEF data files.

  • Locate headers for data rows.

  • Properly match headers we can read with cryptic built-in headers.

  • Parse the data to see if we have dupes.

  • Parse the data to see if we have missing data.

  • Merge data with other rows based on household.

  • Save data.

These are more or less in chronological order, and listing them takes some of the agony out of figuring out how to organize and script our code and how to document our new script.

One of the first things we need to do is organize all of the chunks of code we have written in this chapter and the previous chapter into one script. Once we have them all together, we can start to follow our rules for writing good code. Let’s take a look at our script so far:

from csv import reader
import dataset

data_rdr = reader(open('../../../data/unicef/mn.csv', 'rb'))
header_rdr = reader(open('../../../data/unicef/mn_headers_updated.csv', 'rb'))

data_rows = [d for d in data_rdr]
header_rows = [h for h in header_rdr if h[0] in data_rows[0]]

all_short_headers = [h[0] for h in header_rows]

skip_index = []
final_header_rows = []

for header in data_rows[0]:
    if header not in all_short_headers:
        print header
        index = data_rows[0].index(header)
        if index not in skip_index:
            skip_index.append(index)
    else:
        for head in header_rows:
            if head[0] == header:
                final_header_rows.append(head)
                break

new_data = []

for row in data_rows[1:]:
    new_row = []
    for i, d in enumerate(row):
        if i not in skip_index:
            new_row.append(d)
    new_data.append(new_row)

zipped_data = []

for drow in new_data:
    zipped_data.append(zip(final_header_rows, drow))

# look for missing

for x in zipped_data[0]:
    if not x[1]:
        print x

# look for dupes

set_of_keys = set([
    '%s-%s-%s' % (x[0][1], x[1][1], x[2][1]) for x in zipped_data])

uniques = [x for x in zipped_data if not
           set_of_keys.remove('%s-%s-%s' %
                              (x[0][1], x[1][1], x[2][1]))]

print len(set_of_keys)

# save to db

db = dataset.connect('sqlite:///../../data_wrangling.db')

table = db['unicef_survey']

for row_num, data in enumerate(zipped_data):
    for question, answer in data:
        data_dict = {
            'question': question[1],
            'question_code': question[0],
            'answer': answer,
            'response_number': row_num,
            'survey': 'mn',
        }

    table.insert(data_dict)

We can see most of our code is flat, meaning we don’t have nested levels of importance. Much of the code and functions sit without indentation or documentation in the file. It’s not well abstracted, and the variable names are unclear. Let’s start working on parts of that, beginning at the top. The first two sets of lines repeat each other. Let’s write a function to do that instead:

def get_rows(file_name):
    rdr = reader(open(file_name, 'rb'))
    return [row for row in rdr]

Now we can use this function to shorten our file. Let’s take a look at the next section of code and see if we can improve it as well.

We are spending time rewriting our header_rows to align with headers from data_rows; however, that bit of code is no longer needed. Because we create the final_header_rows from matches between the two, we don’t need to worry about header_rows with no data_rows to match. We can remove that line.

Lines 14–27 all deal with creating the final_header_rows and skip_index lists. We can summarize these as both working on eliminating nonmatching elements so we can zip our final list. Let’s put them together in one method:

def eliminate_mismatches(header_rows, data_rows):
    all_short_headers = [h[0] for h in header_rows]
    skip_index = []
    final_header_rows = []

    for header in data_rows[0]:
        if header not in all_short_headers:
            index = data_rows[0].index(header)
            if index not in skip_index:
                skip_index.append(index)
        else:
            for head in header_rows:
                if head[0] == header:
                    final_header_rows.append(head)
                    break
    return skip_index, final_header_rows

We have now combined even more sections of our cleanup into functions. This helps us delineate what each function does and documents our code so if (or should we say, when) we need to update it, we know exactly where to look.

Let’s read on in our script and see if we have more contenders. It looks like the next section creates our zipped dataset. We could make this two functions: one to whittle down our data rows into just those that match the headers, and another that zips the two. We could also leave it as one function to create the zipped data. In the end, it’s up to you to determine what might fit best. Here, we will keep it as one function with a smaller helper function, in case we need it again:

def zip_data(headers, data):
    zipped_data = []
    for drow in data:
        zipped_data.append(zip(headers, drow))
    return zipped_data


def create_zipped_data(final_header_rows, data_rows, skip_index):
    new_data = []
    for row in data_rows[1:]:
        new_row = []
        for index, data in enumerate(row):
            if index not in skip_index:
                new_row.append(data)
        new_data.append(new_row)
    zipped_data = zip_data(final_header_rows, new_data)
    return zipped_data

With our new functions, we were able to preserve our code, clear up some variable names, and add a helper function to zip headers with rows of data and return the list of zipped data. The code is clearer and broken up more appropriately. We’re going to continue to apply the same logic to the rest of the file. Let’s take a look at the result:

from csv import reader
import dataset


def get_rows(file_name):
    rdr = reader(open(file_name, 'rb'))
    return [row for row in rdr]


def eliminate_mismatches(header_rows, data_rows):
    all_short_headers = [h[0] for h in header_rows]
    skip_index = []
    final_header_rows = []

    for header in data_rows[0]:
        if header not in all_short_headers:
            index = data_rows[0].index(header)
            if index not in skip_index:
                skip_index.append(index)
        else:
            for head in header_rows:
                if head[0] == header:
                    final_header_rows.append(head)
                    break
    return skip_index, final_header_rows


def zip_data(headers, data):
    zipped_data = []
    for drow in data:
        zipped_data.append(zip(headers, drow))
    return zipped_data


def create_zipped_data(final_header_rows, data_rows, skip_index):
    new_data = []
    for row in data_rows[1:]:
        new_row = []
        for index, data in enumerate(row):
            if index not in skip_index:
                new_row.append(data)
        new_data.append(new_row)
    zipped_data = zip_data(final_header_rows, new_data)
    return zipped_data


def find_missing_data(zipped_data):
    missing_count = 0
    for question, answer in zipped_data:
        if not answer:
            missing_count += 1
    return missing_count


def find_duplicate_data(zipped_data):
    set_of_keys = set([
        '%s-%s-%s' % (row[0][1], row[1][1], row[2][1])
        for row in zipped_data])

    uniques = [row for row in zipped_data if not
               set_of_keys.remove('%s-%s-%s' %
                                  (row[0][1], row[1][1], row[2][1]))]

    return uniques, len(set_of_keys)


def save_to_sqlitedb(db_file, zipped_data, survey_type):
    db = dataset.connect(db_file)

    table = db['unicef_survey']
    all_rows = []

    for row_num, data in enumerate(zipped_data):
        for question, answer in data:
            data_dict = {
                'question': question[1],
                'question_code': question[0],
                'answer': answer,
                'response_number': row_num,
                'survey': survey_type,
            }
            all_rows.append(data_dict)

    table.insert_many(all_rows)

Now we have a bunch of nice functions, but we’ve gutted how the program runs. If we run this script right now, no lines of code execute. It’s just a set of written functions never called by anything.

We need to now work on re-creating how to use all these steps in a main function. The main function is often where Python developers will put code intended to run via the command line. Let’s add our main function with the code to clean our datasets:

""" This section goes at the bottom of the script we've already written. """

def main():
    data_rows = get_rows('data/unicef/mn.csv')
    header_rows = get_rows('data/unicef/mn_headers_updated.csv')
    skip_index, final_header_rows = eliminate_mismatches(header_rows,
                                                          data_rows)
    zipped_data = create_zipped_data(final_header_rows, data_rows, skip_index)
    num_missing = find_missing_data(zipped_data)
    uniques, num_dupes = find_duplicate_data(zipped_data)
    if num_missing == 0 and num_dupes == 0:
        save_to_sqlitedb('sqlite:///data/data_wrangling.db', zipped_data)
    else:
        error_msg = ''
        if num_missing:
            error_msg += 'We are missing {} values. '.format(num_missing)
        if num_dupes:
            error_msg += 'We have {} duplicates. '.format(num_dupes)
        error_msg += 'Please have a look and fix!'
        print error_msg


if __name__ == '__main__':
    main()

Now we have an executable file we can run from the command line. What happens when you run this file? Do you get our newly created error message, or has your data been saved to your local SQLite database?

Tip

If you run into any errors, try checking to make sure your code looks exactly like this and you are using the proper file paths to the data from the repository and the local database you created in Chapter 6.

Now let’s put some work into documenting our code. We’re going to add some docstrings to our functions, some inline notes so we can easily read the more complex bits of our script, and a larger explanation at the top of the script that we might move to a README.md file:

"""
Usage: python our_cleanup_script.py

This script is used to intake the male survey data from UNICEF
and save it to a simple database file after it has been checked
for duplicates and missing data and after the headers have been properly
matched with the data. It expects there to be a 'mn.csv' file with the
data and the 'mn_updated_headers.csv' file in a subfolder called 'unicef' within
a data folder in this directory. It also expects there to be a SQLite
file called 'data_wrangling.db' in the root of this directory. Finally,
it expects to utilize the dataset library
(http://dataset.readthedocs.org/en/latest/).

If the script runs without finding any errors, it will save the
cleaned data to the 'unicef_survey' table in the SQLite.
The saved data will have the following structure:
    - question: string
    - question_code: string
    - answer: string
    - response_number: integer
    - survey: string

The response number can later be used to join entire responses together
(i.e., all of response_number 3 come from the same interview, etc.).

If you have any questions, please feel free to contact me via ...
"""

from csv import reader
import dataset


def get_rows(file_name):
    """Return a list of rows from a given csv filename."""
    rdr = reader(open(file_name, 'rb'))
    return [row for row in rdr]


def eliminate_mismatches(header_rows, data_rows):
    """
    Return index numbers to skip in a list and final header rows in a list
    when given header rows and data rows from a UNICEF dataset. This
    function assumes the data_rows object has headers in the first element.
    It assumes those headers are the shortened UNICEF form. It also assumes
    the first element of each header row in the header data is the
    shortened UNICEF form. It will return the list of indexes to skip in the
    data rows (ones that don't match properly with headers) as the first element
    and will return the final cleaned header rows as the second element.
    """
    all_short_headers = [h[0] for h in header_rows]
    skip_index = []
    final_header_rows = []

    for header in data_rows[0]:
        if header not in all_short_headers:
            index = data_rows[0].index(header)
            if index not in skip_index:
                skip_index.append(index)
        else:
            for head in header_rows:
                if head[0] == header:
                    final_header_rows.append(head)
                    break
    return skip_index, final_header_rows


def zip_data(headers, data):
    """
    Return a list of zipped data when given a header list and data list. Assumes
    the length of data elements per row and the length of headers are the same.

    example output: [(['question code', 'question summary', 'question text'],
		       'resp'), ....]
    """
    zipped_data = []
    for drow in data:
        zipped_data.append(zip(headers, drow))
    return zipped_data


def create_zipped_data(final_header_rows, data_rows, skip_index):
    """
    Returns a list of zipped data rows (matching header and data) when given a
    list of final header rows, a list of data rows, and a list of indexes on
    those data rows to skip as they don't match properly. The function assumes
    the first row in the data rows contains the original data header values,
    and will remove those values from the final list.
    """
    new_data = []
    for row in data_rows[1:]:
        new_row = []
        for index, data in enumerate(row):
            if index not in skip_index:
                new_row.append(data)
        new_data.append(new_row)
    zipped_data = zip_data(final_header_rows, new_data)
    return zipped_data


def find_missing_data(zipped_data):
    """
    Returns a count of how many answers are missing in an entire set of zipped
    data. This function assumes all responses are stored as the second element.
    It also assumes every response is stored in a list of these matched question,
    answer groupings. It returns an integer.
    """
    missing_count = 0
    for response in zipped_data:
        for question, answer in response:
            if not answer:
                missing_count += 1
    return missing_count


def find_duplicate_data(zipped_data):
    """
    Returns a list of unique elements and a number of duplicates found when given
    a UNICEF zipped_data list. This function assumes that the first three rows of
    data are structured to have the house, cluster, and line number of the
    interview and uses these values to create a unique key that should not be
    repeated.
    """

    set_of_keys = set([
        '%s-%s-%s' % (row[0][1], row[1][1], row[2][1])
        for row in zipped_data])

    #TODO: this will throw an error if we have duplicates- we should find a way
    #around this
    uniques = [row for row in zipped_data if not
               set_of_keys.remove('%s-%s-%s' %
                                  (row[0][1], row[1][1], row[2][1]))]

    return uniques, len(set_of_keys)


def save_to_sqlitedb(db_file, zipped_data, survey_type):
    """
    When given a path to a SQLite file, the cleaned zipped_data, and the
    UNICEF survey type that was used, saves the data to SQLite in a
    table called 'unicef_survey' with the following attributes:
        question, question_code, answer, response_number, survey
    """
    db = dataset.connect(db_file)

    table = db['unicef_survey']
    all_rows = []

    for row_num, data in enumerate(zipped_data):
        for question, answer in data:
            data_dict = {
                'question': question[1],
                'question_code': question[0],
                'answer': answer,
                'response_number': row_num,
                'survey': survey_type,
            }
            all_rows.append(data_dict)

    table.insert_many(all_rows)


def main():
    """
    Import all data into rows, clean it, and then if
    no errors are found, save it to SQlite.
    If there are errors found, print out details so
    developers can begin work on fixing the script
    or seeing if there is an error in the data.
    """

    #TODO: we probably should abstract these files so that we can pass
    # them in as variables and use the main function with other surveys
    data_rows = get_rows('data/unicef/mn.csv')
    header_rows = get_rows('data/unicef/mn_updated_headers.csv')
    skip_index, final_header_rows = eliminate_mismatches(header_rows,
                                                         data_rows)
    zipped_data = create_zipped_data(final_header_rows, data_rows, skip_index)
    num_missing = find_missing_data(zipped_data)
    uniques, num_dupes = find_duplicate_data(zipped_data)
    if num_missing == 0 and num_dupes == 0:
        #TODO: we probably also want to abstract this
        # file away, or make sure it exists before continuing
        save_to_sqlite('sqlite:///data_wrangling.db', zipped_data, 'mn')
    else:
        #TODO: eventually we probably want to log this, and
        # maybe send an email if an error is thrown rather than print it
        error_msg = ''
        if num_missing:
            error_msg += 'We are missing {} values. '.format(num_missing)
        if num_dupes:
            error_msg += 'We have {} duplicates. '.format(num_dupes)
        error_msg += 'Please have a look and fix!'
        print error_msg


if __name__ == '__main__':
    main()

Our code is now better documented, organized, and it has a clear set of reusable functions. This is a great start for our first script. Hopefully, we can use this code to import many sets of UNICEF data!

Note

We’ve also put in some “TODO” notes for ourselves so we can improve the script over time. Which issues do you think are the most pressing? Why? Can you take a stab at fixing them?

We only have one file to run our code. However, as your code grows, your repository will as well. It’s important to think about what you might add to your repository over time early on. Code and code structure are pretty similar. If you think this repository might be used for more than just UNICEF data parsing, you might want to organize it differently.

How so? For one thing, you might want to keep the data in a separate file. In fact, depending on how large your repository might grow, you might want the different data parsers and cleaners in separate folders.

Tip

Don’t worry too much about these decisions at the beginning. As you get better at Python and understanding your datasets, it will begin to be more obvious to you where to begin.

In terms of organizing your repository, it is fairly common to have a utils or common folder where you can store parts of the script shared among sections of code. Many developers store things like database connection scripts, commonly used API code, and communication or email scripts in such a folder, so they can be imported easily into any other script.

Depending on how the rest of your repository is managed, you might have several directories set up to contain different aspects of your project. One directory could be related only to UNICEF data. Another could contain web-scraping scripts or final reporting code. How you organize your repository is up to you. Always opt for being clear, obvious, and organized.

If you end up needing to reorganize your repository later, it will be far less awful if you took time to organize it in the first place. If, on the other hand, your repository is rife with 800-line files and no clear documentation, you’ll have quite a task at hand. The best rule of thumb is to outline a good starting point for organization, and do occasional housekeeping as your repository grows and changes.

Outside of file organization, naming your directories, files, functions, and classes in clear and obvious ways will also help. In the utils folder, you might have half a dozen files. If you name them utils1, utils2, and so on, you will always need to look and see what they contain. However, if you have named them email.py, database.py, twitter_api.py, and so on, the filenames will give you more information.

Being explicit in every aspect of your code is a great start to a long and prosperous career in Python data wrangling. Let’s think about our repository and organize how we might expect to find our files:

data_wrangling_repo/
|-- README.md
|-- data_wrangling.db
|-- data/
|   `-- unicef/
|       |-- mn.csv
|       |-- mn_updated_headers.csv
|       |-- wm.csv
|       `-- wm_headers.csv
|-- scripts/
|   `-- unicef/
|       `-- unicef_cleanup.py (script from this chp)
`-- utils/
    |-- databases.py
    `-- emailer.py

We haven’t yet written a databases or emailer file, but we probably should. What else could we add to the structure? Why do you think we made two different unicef folders in our repository? Should developers separate their data files from the script files?

Note

While your project’s folder structure may look like this, bear in mind that data is not usually stored or housed in a repository. Keep the data files for your project on a shared file server or somewhere on the local network. If you are working alone, make sure you have a backup somewhere. Don’t check these large files into your repository. Not only will it slow down work if you need to check out the repository on a new device, but it’s also not a smart way to manage your data.

We would advise against checking your db files or any log or config files into your repository. Do your best to build the structure in a convenient way. You can always add the expected structure of the files in your README.md and provide details about where to get the data files.

Creating a logical repository structure and using README.md and .gitignore files allows you to have an organized project folder with modular code and avoid storing large data files or potentially sensitive data (database or login data) in your repository.

Testing with New Data

Now that we’ve documented, scripted, and organized our code, we should write some tests or try it out using test data. This helps ensure we’ve properly executed what we’d like to see and keeps our code well defined. Because one of the reasons we scripted the data cleanup was so we could reuse it, testing it with new data proves our time and effort standardizing the code were appropriately spent.

One way we can test the script we have just written is to see how easily we can apply it to similar data we find on the UNICEF website. Let’s take a look and see. You should have the wm.csv and wm_headers.csv files from the repository. These files are data for the women’s survey from the Zimbabwe UNICEF data.

Let’s try using those files in our script instead of the men’s survey data. To do so, we change just those two filenames from our cleanup script to point to the two women’s survey data files. We should also change the survey type to 'wm', so we can differentiate the data found in each set.

Note

The women’s dataset is significantly larger than the men’s. If you have unsaved data, we recommend saving it and closing other programs before proceeding. On that note, it’s probably a good idea to start considering how we can improve memory usage in our script.

Let’s take a look and see if it successfully imported our data:

import dataset

db = dataset.connect('sqlite:///data_wrangling.db')

wm_count = db.query('select count(*) from unicef_survey where survey="wm"') 1

count_result = wm_count.next() 2

print count_result
1

We use a direct query so we can quickly see the number of rows we have where survey='wm'. This should only include rows from the second run, where we set the type to 'wm'.

2

This reads the result of our query, using the query reponse’s next method to pull the first result. Because we used count, we should only have one response with the count.

So, we successfully imported more than 3 million questions and answers from our women’s dataset. We know our script works, and we can see the results!

Testing your script using similar data is one way to go about ensuring it works as intended. It can also show your script is built generally enough for reuse. However, there are many other ways to test your code. Python has quite a few good testing libraries to help you write test scripts and utilize test data (and even test API responses) so you can ensure your code functions properly.

There are several testing modules built into Python’s standard library. unittest provides unit tests for your Python code. It comes with some nice built-in classes with assert statements to test whether your code is working. If we were to write unit tests for our code, we could write one to assert the get_rows function returns a list. We could also assert the length of the list and the number of data lines in the file are the same. Each function can have these tests and assertions.

Another popular Python testing framework is nose. nose is a very powerful testing framework with quite a lot of options in terms of extra plugins and configuration. It’s great if you have a large repository with different testing requirements and many developers working on the same code.

Can’t decide which one to start with? Then pytest might be right for you. It allows you to write tests in either style and switch if need be. It also has a fairly active community for talks and tutorials, so it’s a great place to start if you want to learn more and then start writing your tests.

Note

Normally, your test suite would be organized with a test file in each module (i.e., given our current repository structure, we would put a test file in each directory we have except for our data and configuration folders). Some people write a test file for every Python file in each folder, so that it’s easy to see where the tests are for specific files. Others use a separate directory including the tests, with a structure that maps to the same Python files in the script part of the structure.

Whatever test style or organization you choose, make sure you are consistent and clear. That way you’ll always know where to look for the tests, and you (and others) will be able to run them as needed.

Summary

In this chapter, we covered some basics in terms of standardizing your data, and when it might be useful to normalize your data or remove outliers. You were able to export your clean data (from your work in Chapter 6) into your database or a local file, and you began writing more coherent functions for those repetitive processes.

Additionally, you’ve worked on organizing your Python repository with nested folders and properly named files, and started to document and analyze your code. Finally, you had a basic introduction to testing, and some of the tools you’ll have at your disposal when you start writing your tests.

Table 8-2 lists the Python concepts covered in this chapter.

Table 8-2. New Python and programming concepts and libraries
Concept/Library Purpose

Dataset insert method

Allows you to easily store your data into your SQLite database using an insert command.

CSV writer object

Lets you store your data in a CSV using the csv writer class.

Zen of Python (import this)

A philosophy for how to write and think like a Python programmer.

Python best practices

A basic outline of some best practices to follow as a new Python developer.

Python command-line execution (if __name__ == '__main__':)

Formatting a script with this block allows you to run your main function from the command line.

TODO notation

Allows you to easily see what needs to be done for your script via commenting.

Git

Version control system to help track changes in your code. Absolutely essential for code you want to deploy or use with others, but also incredibly useful on a local solo project. More on Git in Chapter 14.

In the next chapter, you’ll keep honing those cleanup and data analysis skills and use them to help prep a new dataset, as we move on to analyzing your data.