Chapter 7. Data Cleanup: Investigation, Matching, and Formatting

Cleaning up your data is not the most glamourous of tasks, but it’s an essential part of data wrangling. Becoming a data cleaning expert requires precision and a healthy knowledge of your area of research or study. Knowing how to properly clean and assemble your data will set you miles apart from others in your field.

Python is well designed for data cleanup; it helps you build functions around patterns, eliminating repetitive work. As we’ve already seen in our code so far, learning to fix repetitive problems with scripts and code can turn hours of manual work into a script you run once.

In this chapter, we will take a look at how Python can help you clean and format your data. We’ll also use Python to locate duplicates and errors in our datasets. We will continue learning about cleanup, especially automating our cleanup and saving our cleaned data, in the next chapter.

Why Clean Data?

Some data may come to you properly formatted and ready to use. If this is the case, consider yourself lucky! Most data, even if it is cleaned, has some formatting inconsistencies or readability issues (e.g., acronyms or mismatched description headers). This is especially true if you are using data from more than one dataset. It’s unlikely your data will properly join and be useful unless you spend time formatting and standardizing it.

Note

Cleaning your data makes for easier storage, search, and reuse. As we explored in Chapter 6, it’s much easier to store your data in proper models if it’s cleaned first. Imagine if you had columns or fields in your dataset which should be saved as a particular data type (such as dates or numbers or email addresses). If you can standardize what you expect to see and clean or remove records that don’t fit, then you ensure your data’s consistency and eliminate hard work later when you need to query assets in your dataset.

If you’d like to present your findings and publish your data, you’ll want to publish the cleaned version. This gives other data wranglers the opportunity to easily import and analyze the data. You can also publish the raw data alongside your finished dataset with notations on what steps you took to clean and normalize it.

As we work on cleaning our data, for our benefit and the benefit of others, we want to document the steps we have taken so we can accurately defend our dataset and its use in our studies. By documenting our process, we ensure we can reproduce it when new data comes out.

Tip

One powerful tool if you are using IPython to interact with your data is to use the IPython magic commands, such as %logstart or to start logging and %save to save your session for later use. This way you can begin building scripts, not just hacking in a Python terminal. As your Python knowledge increases, you can refine the scripts to share with others. For more reading on IPython, check out Appendix F.

Let’s start by investigating data cleaning basics, learning how to format our data and properly match datasets together.

Data Cleanup Basics

If you have been working through the code in the preceding chapters, you have already used some data cleanup concepts. In Chapter 4, we worked on importing data from Excel sheets and creating a dictionary to represent that data. Modifying and standardizing the data into a new data format is data cleanup.

Because we’ve already investigated some UNICEF datasets related to child labor (see “Child Labor”), let’s dive into the raw UNICEF data. The initial datasets most UNICEF reports accumulate are the Multiple Indicator Cluster Surveys (MICS). These surveys are household-level surveys performed by UNICEF workers and volunteers to help research the living conditions of women and children throughout the world. In looking through the latest surveys, we pulled some data from Zimbabwe’s latest MICS to analyze.

To begin our analysis, we downloaded the latest surveys after first requesting access for educational and research purposes from UNICEF. After getting access (which took approximately a day), we were able to download the raw datasets. Most MICS raw data is in SPSS format, or .sav files. SPSS is a program used by social scientists to store and analyze data. It’s a great tool for some social science statistics, but it’s not very useful for our Python needs.

In order to convert the SPSS files to something we can use, we used the open source project PSPP to view the data, and then a few simple R commands to convert the SPSS data into .csv files for easy use with Python. There are some good projects using Python to interact with SPSS files as well, but they required more work and setup than the R commands. You’ll find the updated CSV in this book’s repository.

Let’s get started with our data cleanup by diving into the files and taking a look at the data. Often your first cleanup steps come from a simple visual analysis. Let’s dig into our files and see what we find!

Identifying Values for Data Cleanup

We begin our data cleanup with a simple review of the fields we find and any visual inconsistencies we can see. If you start your data cleanup by making your data look cleaner, you will have a good idea of the initial problems you must conquer as you normalize your data.

Let’s take a look at our mn.csv file. The file contains raw data and uses codes (acronyms) as headers which likely contain some easily translatable meaning. Let’s take a look at the column headers in our mn.csv file:

"","HH1","HH2","LN","MWM1","MWM2", ...

Each of these represents a question or data in the survey, and we’d like the more human-readable versions. Searching via Google, we locate the human-readable values for those headings on the World Bank site for sharing MICS data.

Tip

Take time to first investigate whether data like the abbreviation listing on the World Bank site exists to help with your cleanup needs. You can also pick up the phone and give the organization a call to ask if they have an easy-to-use abbreviation list.

Using some web scraping skills you’ll become acquainted with in Chapter 11, we were able to get a CSV of these headers with their English variants and the questions used to calculate their values from the World Bank site for MICS data. We’ve included the new headers from our web scraper in the book’s repository (mn-headers.csv). We want to match up this data with our survey data so we have readable questions and answers. Let’s look at a few ways we can do that.

Replacing headers

The most straightforward and obvious way to make the headers more readable is to merely replace the short headers with longer English ones we can understand. How might we go about header substitution using Python? First, we’ll need to import both the mn.csv and mn-headers.csv files using the csv module we learned about in Chapter 3 (see the following code for importing). Throughout this chapter and the following chapters, feel free to write code in either scripts or in your terminal (such as IPython). This will allow you to interact with the data before saving it to a file:

from csv import DictReader

data_rdr = DictReader(open('data/unicef/mn.csv', 'rb'))
header_rdr = DictReader(open('data/unicef/mn_headers.csv', 'rb'))

data_rows = [d for d in data_rdr] 
header_rows = [h for h in header_rdr]

print data_rows[:5] 
print header_rows[:5]

: This code writes the iterable DictReader object into a new list so we can preserve the data and reuse it. We’re using the list generator format so we can do it in one simple line of code that’s readable and clear.
: This prints just a slice of the data, by using the Python list’s slice method to show the first five elements of our new lists and get an idea of the content.

In the fourth line of code, we used a list generator function for Python. Python list generators have the following format:

[func(x) for x in iter_x]

A list generator starts and ends with list brackets. Then it takes an iterable object (iter_x) and passes each row or value from iter_x into func(x) to create new values for the new list. Here, we are not doing anything with the function part of the list generator; we only want the row as it currently stands. In future chapters, we’ll use the ability to pass each row or value from the iterable into a function to clean or change the data before we put it into the list. List generators are a great example of the easy-to-read and easy-to-use syntax Python is well known for. You could achieve the same functionality using a for loop, but that would require more code:

new_list = []
for x in iter_x:
    new_list.append(func(x))

As you can see, using a list generator saves us a few lines of code, and offers superior performance and memory efficiency.

We want to replace the data_row dictionary headers with the readable headers from our file. As we can see from our output, the header_rows dictionaries hold both the short and longer values. The current short headers are contained under the Name field, and the longer, more readable headers are saved under the Label field. Let’s see how easily we can match them up using some Python string methods:

for data_dict in data_rows: 
    for dkey, dval in data_dict.items(): 
        for header_dict in header_rows: 
            for hkey, hval in header_dict.items():
                if dkey == hval: 
                    print 'match!'

: Iterates over each of our data records. We will try to use the keys in each of these dictionaries to match our headers.
: Iterates over each key and value in each data row so we can replace all of the keys with the more readable header labels (to view each key-value pair in the data dictionary, we use the Python dictionary’s items method).
: Iterates over all header rows of data, so we can get the readable labels. It’s not the fastest way to do this, but it makes sure we don’t miss any.
: Prints found matches between the data list keys (MWB3, MWB7, MWB4, MWB5…) and the header dictionary data.

By running the code, we see we have many matches. Let’s see if we can use similar logic to replace the titles with better ones. We know we can match them really easily. However, we only found the row we wanted to match. Let’s see if we can figure out how to match the keys from the data list with the values from the row in the header row we found:

new_rows = [] 

for data_dict in data_rows:
    new_row = {} 
    for dkey, dval in data_dict.items():
        for header_dict in header_rows:
            if dkey in header_dict.values(): 
                new_row[header_dict.get('Label')] = dval 
    new_rows.append(new_row)

: Creates a new list to populate with cleaned rows.
: Creates a new dictionary for each row.
: Here, we use the dictionary’s values method instead of iterating over every key and value of the header rows. This method returns a list of only the values in that dictionary. We are also using Python’s in method, which tests whether an object is a member of a list. For this line of code, the object is our key, or the abbreviated string, and the list is the values of the header dictionary (which contains the abbreviated headers). When this line is true, we know we have found the matching row.
: Adds to our new_row dictionary every time we find a match. This sets the dictionary key equal to the Label value in the header row, replacing those short Name values with the longer, more readable Label values, and keeps the values set to the data row values.
: Appends the new cleaned dictionary we created to our new array. This is indented to ensure we have all the matches before going to the next row.

You can see from a simple print of the first record of our new values that we have successfully made the data readable:

In [8]: new_rows[0]
Out[8]: {
    'AIDS virus from mother to child during delivery': 'Yes',
    'AIDS virus from mother to child during pregnancy': 'DK',
    'AIDS virus from mother to child through breastfeeding': 'DK',
    'Age': '25-29',
    'Age at first marriage/union': '29',...

Tip

One easy way to determine whether you have the proper indentation for your function is to look at other lines with the same indentation. Always ask yourself: What other code logically goes with this step? When should I move on to the next step in the process?

There isn’t always only one good solution to data cleanup problems, so let’s see if we can solve our unreadable header problem another way using a different technique.

Zipping questions and answers

Another way to fix the label problem is to use Python’s zip method:

from csv import reader 

data_rdr = reader(open('data/unicef/mn.csv', 'rb'))
header_rdr = reader(open('data/unicef/mn_headers.csv', 'rb'))

data_rows = [d for d in data_rdr]
header_rows = [h for h in header_rdr]

print len(data_rows[0]) 
print len(header_rows)

: This time, instead of using DictReader, we use the simple reader class. The simple reader creates a list for each row, rather than a dictionary. Because we want to use zip, we need lists instead of dictionaries, so we can zip the list of header values with the list of data values.
: These lines create lists for our header and data readers and print them to see if they are the same length.

Oops—our printed length output shows we have a mismatch in the length of our data and headers! Our data shows only 159 rows while our header list shows we have 210 possible headers. This likely means MICS uses more questions for other countries or provides more questions to choose from than we have in our Zimbabwe dataset.

We need to further investigate which headers are used in the dataset and which ones we can leave behind. Let’s take a closer look to find which ones don’t align properly:

In [22]: data_rows[0]
Out[22]: ['',
 'HH1',
 'HH2',
 'LN',
 'MWM1',
 'MWM2',
 'MWM4',
 'MWM5',
 'MWM6D',
 'MWM6M',
 'MWM6Y',
... ]

In [23]: header_rows[:2]
Out[23]: [
 ['Name', 'Label', 'Question'],
 ['HH1', 'Cluster number', '']]

OK, so we can clearly see here we need to match the data_rows second row with the first index of the header_rows. Once we identify which ones don’t match, we want to toss them out of the header_rows so we can zip the data properly:

bad_rows = []

for h in header_rows:
    if h[0] not in data_rows[0]: 
        bad_rows.append(h) 

for h in bad_rows:
    header_rows.remove(h) 

print len(header_rows)

: Tests if the first element of the header row (the shorthand version of the header) is in the first row of the data (all shortened headers).
: Appends the rows identified as having mismatched headers to our new list, bad_rows. We use this in the next step to identify rows to remove.
: Uses the list’s remove method to remove a particular row of data from a list. This method is often useful in situations where you can identify one specific row (or set of rows) you want to remove from a list.

Aha! So now we can see we are nearly matching. We have 159 values in our data rows and 150 values in our header list. Now let’s see if we can figure out why we don’t have those nine matching headers in our header list:

all_short_headers = [h[0] for h in header_rows] 

for header in data_rows[0]: 
    if header not in all_short_headers: 
        print 'mismatch!', header

: Uses Python list comprehension to make a list of all the short headers by collecting only the first element of each header row.
: Iterates over the headers in our dataset to see which ones don’t align with our cleaned header list.
: Singles out the headers that don’t match from our abbreviated list.
: Uses print to display the mismatches. If you need a quick way to print two strings on the same line, you can simply use a , in between them to concatenate the strings with a space.

When you run this code, your output should look something like:

mismatch!
mismatch! MDV1F
mismatch! MTA8E
mismatch! mwelevel
mismatch! mnweight
mismatch! wscoreu
mismatch! windex5u
mismatch! wscorer
mismatch! windex5r

From the output and our current knowledge of the data, we can see that only a few of the mismatched headers (those with capitals) are ones we might want to fix. The lowercase titles are used for UNICEF internal methodology and don’t line up with questions we have for our own investigation.

Because the MDV1F and MTA8E variables were not found with the web scraper we built to collect headers from the World Bank site, we need to investigate what they mean using our SPSS viewer. (The other option is to drop these rows and move on.)

Tip

When you are dealing with raw data, sometimes you’ll find that getting it into a usable form means dropping data you don’t need or data that’s difficult to clean. In the end, the determining factor should not be sloth, but instead whether the data is essential to your questions.

After opening the SPSS viewer, we can see MDV1F matches the label “If she commits infidelity: wife beating justified” and matches up with another longer set of questions regarding domestic abuse. We have other questions related to relationship abuse, so it’s probably a good idea to include this. Investigating the MTA8E header shows it matches up with a different series of questions, about which type of tobacco is smoked by the person. We have added both to a new file, mn_headers_updated.csv.

Now we can retry the original code, this time using our updated headers file:

Let’s look at it all together and make a few changes so we can try zipping our headers and data together. The following script requires a lot of memory, so if you have less than 4GB RAM, we recommend running it in an IPython terminal or notebook to help mitigate segmentation faults:

from csv import reader

data_rdr = reader(open('data/unicef/mn.csv', 'rb'))
header_rdr = reader(open('data/unicef/mn_headers_updated.csv', 'rb'))

data_rows = [d for d in data_rdr]
header_rows = [h for h in header_rdr if h[0] in data_rows[0]] 

print len(header_rows)

all_short_headers = [h[0] for h in header_rows]

skip_index = [] 

for header in data_rows[0]:
    if header not in all_short_headers:
        index = data_rows[0].index(header) 
        skip_index.append(index)

new_data = []

for row in data_rows[1:]: 
    new_row = []
    for i, d in enumerate(row): 
        if i not in skip_index: 
            new_row.append(d)
    new_data.append(new_row) 

zipped_data = []

for drow in new_data:
    zipped_data.append(zip(header_rows, drow))

: Uses list comprehension to quickly remove mismatched headers. As you can see, we can also use an if statement inside a list comprehension. Here, the code makes a list of rows from the header rows list, as long as the first header row element (abbreviated header) is in the headers from the data rows.
: Creates a list to hold the indexes of data rows whose data we aren’t interested in keeping.
: Utilizes the Python list’s index method to return the indexes we should skip because the headers aren’t in the abbreviated list. The next line will save the indexes of the data rows that don’t match our headers, so we can skip collecting that data.
: Slices the list holding the survey data to include only data rows (all rows except the first row) and then iterates through them.
: Uses the enumerate function to isolate the indexes of the data rows to skip. This function takes an iterable object (here, the data row list) and returns the numeric index and value for each item. It assigns the first value (index) to i and assigns the data value to d.
: Tests to make sure the index is not in the list we want to skip.
: After going through each item (or “column”) in the data row, adds the new entry to the new_data list.
: Zips each row (now exactly matched with header and data) and adds it to a new array, zipped_data.

We can now print out a row of our new dataset and see if we have what we expected:

In [40]: zipped_data[0]
Out[40]: [(['HH1', 'Cluster number', ''], '1'),
(['HH2', 'Household number', ''], '17'),
(['LN', 'Line number', ''], '1'),
(['MWM1', 'Cluster number', ''], '1'),
(['MWM2', 'Household number', ''], '17'),
(['MWM4', "Man's line number", ''], '1'),
(['MWM5', 'Interviewer number', ''], '14'),
(['MWM6D', 'Day of interview', ''], '7'),
(['MWM6M', 'Month of interview', ''], '4'),
(['MWM6Y', 'Year of interview', ''], '2014'),
(['MWM7', "Result of man's interview", ''], 'Completed'),
(['MWM8', 'Field editor', ''], '2'),
(['MWM9', 'Data entry clerk', ''], '20'),
(['MWM10H', 'Start of interview - Hour', ''], '17'),
....

We have all of the questions and answers together in tuples, and every row has all of the matching data with headers. To be certain we have everything correct, let’s take a look at the end of that row:

 (['TN11', 'Persons slept under mosquito net last night',
 'Did anyone sleep under this mosquito net last night?'], 'NA'),
 (['TN12_1', 'Person 1 who slept under net',
 'Who slept under this mosquito net last night?'], 'Currently married/in union'),
 (['TN12_2', 'Person 2 who slept under net',
 'Who slept under this mosquito net last night?'], '0'),

This looks strange. It seems like we might have some mismatches. Let’s do a reality check and use our newly learned zip method to see if our headers match up properly:

data_headers = []

for i, header in enumerate(data_rows[0]): 
    if i not in skip_index: 
        data_headers.append(header)

header_match = zip(data_headers, all_short_headers) 

print header_match

: Iterates over the headers in the data list.
: By using if...not in..., returns True only for indexes not included in skip_index.
: Zips the new lists of headers together so we can visually check for the mismatch.

Aha! Did you see the error?

....
 ('MHA26', 'MHA26'),
 ('MHA27', 'MHA27'),
 ('MMC1', 'MTA1'),
 ('MMC2', 'MTA2'),
....

Everything is matching until this point, when it appears our header file and data file diverge in regards to question order. Because the zip method expects everything to appear in the same order, we must reorder our headers to match the dataset before we can use it. Here’s our new try at matching our data:

from csv import reader

data_rdr = reader(open('data/unicef/mn.csv', 'rb'))
header_rdr = reader(open('data/unicef/mn_headers_updated.csv', 'rb'))

data_rows = [d for d in data_rdr]
header_rows = [h for h in header_rdr if h[0] in data_rows[0]]

all_short_headers = [h[0] for h in header_rows]

skip_index = []
final_header_rows = [] 

for header in data_rows[0]:
    if header not in all_short_headers:
        index = data_rows[0].index(header)
        skip_index.append(index)
    else: 
        for head in header_rows: 
            if head[0] == header: 
                final_header_rows.append(head)
                break 

new_data = []

for row in data_rows[1:]:
    new_row = []
    for i, d in enumerate(row):
        if i not in skip_index:
            new_row.append(d)
    new_data.append(new_row)

zipped_data = []

for drow in new_data:
    zipped_data.append(zip(final_header_rows, drow))

: Makes a new list to contain the final properly ordered header rows.
: Uses an else statement to include only columns where we have a match.
: Iterates over our header_rows until there’s a match.
: Tests the short header to see if the question lines up. We use == to test for a match.
: Uses break to exit the for head in header_rows loop once a match is found. This makes it faster and doesn’t hurt the outcome.
: Zips our new final_header_rows list with the header rows in the proper order.

After running our new code, we want to take a look at the end of our first entry:

(['TN12_3', 'Person 3 who slept under net',
'Who slept under this mosquito net last night?'], 'NA'),
(['TN12_4', 'Person 4 who slept under net',
'Who slept under this mosquito net last night?'], 'NA'),
(['HH6', 'Area', ''], 'Urban'),
(['HH7', 'Region', ''], 'Bulawayo'),
(['MWDOI', 'Date of interview women (CMC)', ''], '1372'),
(['MWDOB', 'Date of birth of woman (CMC)', ''], '1013'),
(['MWAGE', 'Age', ''], '25-29'),

This looks like a good match. We can likely improve the clarity of our code; however, we have found a good way to preserve most of the data and zip our data together, and it works relatively fast.

Tip

You will always need to evaluate how complete you need the data to be and what level of effort fits your project’s cleanup needs. If you are only using one part of the data, it’s likely you don’t need to retain it all. If the dataset is your primary research source, it’s worth more time and effort to keep it complete.

In this section, we’ve learned some new tools and methods to identify what’s wrong or needs cleaning and implemented fixes by combining the Python we know with our own problem-solving techniques. Our first data cleanup effort (replacing the header text) preserved fewer columns and didn’t show us we had some missing headers. However, as long as the resulting dataset had the columns we needed, this would be sufficient, and it was faster and required less code.

Think about these types of issues as you clean your data. Is it essential that you have all the data? If so, how many hours is it worth? Is there an easy way to preserve everything you need and still clean it up properly? Is there a repeatable way? These questions will help guide you in cleaning up your datasets.

Now that we have a good list of data to work with, we’ll move on to other types of cleanup.

Formatting Data

One of the most common forms of data cleanup is getting your unreadable or hard-to-read data and data types to fit a proper readable format. Especially if you need to create reports with the data or downloadable files, you’ll want to make sure it goes from being machine readable to human readable. And if your data needs to be used alongside APIs, you might need specially formatted data types.

Python gives us a ton of ways to format strings and numbers. We used %r, which shows the Python representation of the object in a string or Unicode, in Chapter 5 to debug and show our results. Python also has string formatters %s and %d, which represent strings and digits, respectively. We often use these in conjuction with the print command.

A more advanced way to turn objects into strings or Python representations is to utilize the format method. As clarified in the Python documentation, this method lets us define a string and pass the data as arguments or keyword arguments into the string. Let’s take a closer look at format:

for x in zipped_data[0]:
    print 'Question: {}\nAnswer: {}'.format( 
        x[0], x[1])

: format uses {} to represent where to put the data and the \n newline character to create breaks between the lines.
: Here, we pass the first and second values of the question and answer tuple.

You should see something like this:

Question: ['MMT9', 'Ever used Internet', 'Have you ever used the Internet?']
Answer: Yes
Question: ['MMT10', 'Internet usage in the last 12 months',
'In the last 12 months, have you used the Internet?']
Answer: Yes

This is fairly difficult to read. Let’s try cleaning it up a bit more. From our output, we can see the question tuple has an abbreviation as the 0-index value and a description of the question as the 1-index value. We’d like to just use the second part of the array, which gives us a nice title. Let’s try again:

for x in zipped_data[0]:
    print 'Question: {[1]}\nAnswer: {}'.format(
        x[0], x[1])

: This time we use the ability to single out the index in the format syntax 1, making the output more readable.

Let’s see what output we get:

Question: Frequency of reading newspaper or magazine
Answer: Almost every day
Question: Frequency of listening to the radio
Answer: At least once a week
Question: Frequency of watching TV
Answer: Less than once a week

Now our output is readable. Hooray! Let’s take a look at a few of the other options available with the format method. Our current dataset doesn’t have a large amount of numeric data, so we’ll just use some example numbers to show more formatting options for different numerical types:

example_dict = {
    'float_number': 1324.321325493,
    'very_large_integer': 43890923148390284,
    'percentage': .324,
}

string_to_print = "float: {float_number:.4f}\n" 
string_to_print += "integer: {very_large_integer:,}\n" 
string_to_print += "percentage: {percentage:.2%}" 

print string_to_print.format(**example_dict)

: Uses a dictionary and accesses the values of the dictionary using the keys. We use a : to separate the key name and the pattern. Passing .4f tells Python to make the number a float (f) and show the first four decimal digits (.4).
: Uses the same format (with the key name and colon) and inserts commas (,) to separate thousands.
: Uses the same format (with the key name and colon) but inserts a percentage (%) and shows the first two significant decimal digits (.2).
: Passes our data dictionary into the format method called on the long string and uses ** to unpack the dictionary. Unpacking a Python dictionary will send the key/value pairs in expanded form; here, the unpacked keys and values are sent to the format method.

Note

For more advanced formatting, such as removing unnecessary spaces, aligning the data by length, and performing math equations in the format method, read the Python formatting documentation and examples.

Aside from strings and numbers, Python allows us to easily format dates. Python’s datetime module has methods to format dates you already have (or generate) in Python as well as to read in any date formats and create Python date, datetime, and time objects.

Tip

The most commonly used methods to format dates in Python or make strings into dates are strformat and strpformat, and the formatting might be recognizable if you have used date formatting in other languages. For more information, read the “strftime and strptime Behavior” documentation.

The datetime module’s strptime method allows you to use strings or numbers to create a Python datetime object. This is great if you want to save the date and time to a database or you need to modify the time zone or add an hour. By turning it into a Python object, you can harness the power of Python’s date capabilities and easily turn it back into a human- or machine-readable string later.

Let’s take a look at our data holding interview start and end times from our zipped_data list. To refresh our memories, let’s print some of our first entry to make sure we know what data entries we need to use:

for x in enumerate(zipped_data[0][:20]): 
    print x

.....
(7, (['MWM6D', 'Day of interview', ''], '7'))
(8, (['MWM6M', 'Month of interview', ''], '4'))
(9, (['MWM6Y', 'Year of interview', ''], '2014'))
(10, (['MWM7', "Result of man's interview", ''], 'Completed'))
(11, (['MWM8', 'Field editor', ''], '2'))
(12, (['MWM9', 'Data entry clerk', ''], '20'))
(13, (['MWM10H', 'Start of interview - Hour', ''], '17'))
(14, (['MWM10M', 'Start of interview - Minutes', ''], '59'))
(15, (['MWM11H', 'End of interview - Hour', ''], '18'))
(16, (['MWM11M', 'End of interview - Minutes', ''], '7'))

: Utilizes Python’s enumerate function to see which lines from the data we will need to evaluate.

We now have all the data we need to figure out exactly when the interview started and ended. We could use data like this to determine things whether interviews in the evening or morning were more likely to be completed, and whether the length of the interview affected the number of responses. We can also determine which was the first interview and the last interview and calculate average duration.

Let’s try importing the data into Python datetime objects using strptime:

from datetime import datetime

start_string = '{}/{}/{} {}:{}'.format( 
    zipped_data[0][8][1], zipped_data[0][7][1], zipped_data[0][9][1], 
    zipped_data[0][13][1], zipped_data[0][14][1])

print start_string

start_time = datetime.strptime(start_string, '%m/%d/%Y %H:%M') 

print start_time

: Creates a base string to parse all of the data from the many entries. This code uses American-style date strings formatted with the month, day, year, and then hour and minute.
: Accesses the following format: zipped_data[first data entry][data number row (derived from enumerate)][just the data itself]. Using just the first entry to test, the row at index 8 is the month, the row at index 7 is the day, and the row at index 9 is the year. The second element ([1]) of each tuple is the data.
: Calls the strptime method with a date string and a pattern string using the syntax defined in the Python documentation. %m/%d/%Y is the month, day, year, and %H:%M is the hour and minute. The method returns a Python datetime object.

Note

If you are using IPython to run code, you need not use print to show every line you are interested in viewing. It’s common practice to instead just type the variable name and view the output in the interactive terminal. You can even use Tab to autocomplete.

With our code, we created a common date string and parsed it using datetime’s strptime method. Because each element of the time data is a separate item in our dataset, we could also natively create Python datetime objects without using strptime. Let’s take a look:

from datetime import datetime

end_time = datetime( 
    int(zipped_data[0][9][1]), int(zipped_data[0][8][1]), 
    int(zipped_data[0][7][1]), int(zipped_data[0][15][1]),
    int(zipped_data[0][16][1]))

print end_time

: Uses the datetime class in Python’s datetime module to pass integers directly to form a date object. We pass them as arguments, using commas to separate the elements.
: Because datetime expects integers, this code converts all of our data to integers. The order datetime expects data in is year, month, day, hour, minute, so we must order the data accordingly.

As you can see, with fewer lines of code (in this case) we were able to get the end time of the interview in a Python datetime object. We now have two datetime objects, so let’s do some math with them!

duration = end_time - start_time 

print duration 

print duration.days 

print duration.total_seconds() 

minutes = duration.total_seconds() / 60.0 

print minutes

: Calculates duration by subtracting the start time from the end time.
: Prints a new Python date type. This is a timedelta object. As described in the datetime documentation, timedeltas show differences between two time objects and are used to change or alter time objects.
: Uses timedelta’s built-in days attribute to see how many days the delta spans.
: Calls timedelta’s total_seconds method to calculate the time difference in seconds. This also counts microseconds.
: Calculates the minutes, as timedelta has no minutes attribute.

In running our code, we saw the first interview lasted 8 minutes—but do we know if that is on average how long the interviews last? This is something we can figure out by parsing through the entire dataset using our new datetime skills. We’ve done some simple datetime math and figured out how to create Python datetime objects from our dataset. Now let’s see if we can convert these new datetime objects back into formatted strings for use in a human-readable report:

print end_time.strftime('%m/%d/%Y %H:%M:%S') 

print start_time.ctime() 

print start_time.strftime('%Y-%m-%dT%H:%M:%S')

: strftime requires only one argument, the date pattern you would like to show. This line outputs the standard American time format.
: Python’s datetime objects have a ctime method that outputs the datetime object according to C’s ctime standard.
: Python’s datetime objects can output the string in any way you might wish. This code uses a format often used by PHP. If you need to interact with APIs requiring a special string format, datetime can help.

Python’s datetime objects are incredibly useful and very easy to manipulate, import, and export (via formatting). Depending on your dataset, you can use these new techniques to import and convert all of your string or Excel data into datetime objects, run statistics or averages on them, and then convert them back into strings for your reporting.

We’ve learned numerous formatting tips and tricks. Now let’s begin some more intensive cleanup. We’ll review how to easily find bad seeds in your data and what to do about them.

Finding Outliers and Bad Data

Identifying outliers and bad data in your dataset is probably one of the most difficult parts of data cleanup, and it takes time to get right. Even if you have a deep understanding of statistics and how outliers might affect your data, it’s always a topic to explore cautiously.

Warning

You want to clean your data, not manipulate or change it, so spend some extra time determining how to handle outliers or bad records when considering their removal. You should be very explicit in your final conclusions if you removed outliers to help normalize your data.

We’ll review more ways to find outliers in Chapter 9, but let’s chat about some easy ways to check if you have bad data in your dataset.

Your first clues about data validity come from your source. As we talked about in Chapter 6, you want to ensure your source is properly vetted and you can trust the data. You’ll want to have asked the source how the data was collected and if the data has already been cleaned or processed.

For the samples we are using here, we know that UNICEF surveys follow a standard format of questions. We know they perform these censuses at regular intervals. We also know they have a standard protocol for training their workers on how to properly conduct the interviews. These are all good signs that the data is a proper sample and not a pre-selected sample. If, instead, we found out that UNICEF only interviewed families in large cities and ignored the rural population, this might result in a selection bias or sampling error. Depending on your sources, you should determine what biases your dataset might have.

Tip

You can’t always get perfect data. However, you should be aware of what sampling biases your data might have and ensure you don’t make sweeping claims based on datasets that might not represent the entire story or population.

Moving on from source and data bias, you can find potential errors in your data by asking, “Do any of these data points not fit?” One easy way to tell if you have improper data values is to see if there are errors in the data values you have. For example, you can look through a dataset and see if an important value is missing. You can also look through a dataset and determine if the types (e.g., integer, date, string) don’t properly match up. Let’s take a look at some of these problems in our dataset by attempting to locate some missing data:

for answer in zipped_data[0]: 
    if not answer[1]: 
        print answer

: Iterates over all of the rows of our first entry.
: Tests whether a value “exists.” We know the values are the second entries in the tuples, and we know we can use an if not statement to test this.

From the output of our code, we can see we don’t have any obvious missing data in our first row. How can we test our entire dataset?

for row in zipped_data: 
    for answer in row: 
        if answer[1] is None: 
            print answer

: This time, we loop over every row in our dataset instead of just the first entry.
: We remove the [0] from our previous example, as we have each row as its own loop.
: For example’s sake, here we test if we see any None types. This will tell us if there are null data points, but won’t tell us if we have zeros or empty strings.

We can see we don’t have any obvious missing data in our entire dataset, but let’s take a cursory look at some of our data to see if there are more difficult-to-discern bits of missing data. From our earlier prints, you might remember the usage of NA representing Not Applicable.

Note

Although this is not missing data, we might want to know exactly how many NA answers we have, or if certain questions have an overrepresentation of these answers. If the sample is too small—i.e., if there is a preponderance of NA responses—we probably want to avoid making any larger conclusions based on the available data. If the majority of responses are NA, though, we might find that interesting (why was that question not applicable to the majority of the group?).

Let’s see if there is a preponderance of NA answers for any specific questions:

na_count = {} 

for row in zipped_data:
    for resp in row:
        question = resp[0][1] 
        answer = resp[1]
        if answer == 'NA': 
            if question in na_count.keys(): 
                na_count[question] += 1 
            else:
                na_count[question] = 1 

print na_count

: Defines a dictionary to keep track of questions with NA responses. Keeping the data in a hashed object (like a dictionary) allows Python to quickly and easily query the members. The questions will be the keys and the values will hold the count:
: Stores the second entry from the first part of the tuple (the description of the question) in question. The first entry ([0]) is the shorthand title and the last entry ([2]) is the question the surveyors asked, which is not always available.
: Uses Python’s equivalency test to find NA responses. If we cared about more than one way to write NA, we might use something like if answer in ["NA", "na", "n/a"]: to accept a variety of written responses with the same meaning.
: Tests if this question is already in the dictionary by testing if it is in the keys of the dictionary.
: If the question is already in the keys, this code adds 1 to the value using Python’s += method.
: If it is not a member of the dictionary yet, this code adds it to the dictionary and sets its count value to 1.

Wow! There are quite a few NA responses in our dataset. We have approximately 9,000 rows of data, and some of these questions have more than 8,000 NA responses. It’s possible these questions are not relevant to the demographic or age group surveyed or don’t resonate with the particular nation and culture. Regardless, there is little sense in using the NA questions to draw any sort of larger conclusions about the population surveys.

Finding things like the NA values in your dataset can be very useful in determining whether that dataset is appropriate for your research needs. If you find that the questions you need answered have an overwhelming amount of NA-style responses, you might have to keep looking for another source of data or rethink your questions.

We’ve covered missing data; now let’s see if we can find any type outliers. A type outlier might be present, for example, if a year entry holds a string like 'missing' or 'NA'. If we see just a few data types that don’t align, we might be dealing with outliers or a few instances of bad data. If we see that a large portion of them don’t align, we might want to rethink using that data or determine why they seem to match a “bad data” pattern.

If we can easily account for the inconsistencies (e.g., because this answer only applies to women and the survey sample is mixed-gender), then we can include the data. If there is no clear explanation, and the question is significant and important to our result, we will have to keep investigating our current dataset or begin to look for other datasets that might explain the misalignment.

We’ll talk more about finding outliers in Chapter 9, but for now let’s take a quick look at analyzing data types and see if we can spot any obvious inconsistencies in our current dataset. For example, we should verify that answers we always expect to be numbers (like year of birth) are the right data type.

Let’s take a quick look at the distribution of types in our responses. We’ll use some of the same code we used for counting NA responses but this time we’ll take a look at data types:

datatypes = {} 

start_dict = {'digit': 0, 'boolean': 0,
              'empty': 0, 'time_related': 0,
              'text': 0, 'unknown': 0
              } 

for row in zipped_data:
    for resp in row:
        question = resp[0][1]
        answer = resp[1]
        key = 'unknown' 
        if answer.isdigit(): 
            key = 'digit'
        elif answer in ['Yes', 'No', 'True', 'False']: 
            key = 'boolean'
        elif answer.isspace(): 
            key = 'empty'
        elif answer.find('/') > 0 or answer.find(':') > 0: 
            key = 'time_related'
        elif answer.isalpha(): 
            key = 'text'
        if question not in datatypes.keys(): 
            datatypes[question] = start_dict.copy() 
        datatypes[question][key] += 1 

print datatypes

: The first line initializes a dictionary, because it’s a fast, reliable way to store data on a question-by-question level.
: This sets up a start_dict to ensure the same data exists for each question in our dataset. The dictionary will contain all of our possible guesses as to data type so we can easily compare.
: Here, we set a variable key with the default value unknown. If the key variable is not updated in one of the following if or elif statements, it remains unknown.
: Python’s string class has many methods to help determine type. Here, we use the isdigit method: this line returns True if the string holds a digit.
: To determine if the data relates to Boolean logic, here we test if the answer is in the list of Boolean-based responses, including Yes/No and True/False. Although we could create a more comprehensive test, this is a good starting point.
: The Python string class’s isspace method returns True if the string contains only spaces.
: The string’s find method returns the index of the first match. If it finds no match in the string, it returns -1. This code tests for both / and :, which are commonly used in time strings. This is not a comprehensive check, but should give us an initial idea.
: The string’s isalpha method returns True if the string contains only alphabetic characters.
: As in the code for counting NA responses, here we test if the question is in the keys of the datatypes dictionary.
: If the question is not in the datatypes dictionary, this code adds it and saves a copy of the start_dict as the value. The dictionary’s copy method creates a separate dictionary object for each entry. If we assigned start_dict to each question, we would end up counting the lump sum in one dictionary, rather than starting with a new dictionary for every question.
: This adds 1 to the value of the key we found. So, for each question and response, we have a “guess” about the type.

In our results, we can already see some variance! Some of the question and answer sets have significant representation in one “type”, while others have a variety of type guesses. We can use these as a starting point, as these are just rough guesses.

One way we can start to use this new information is to find questions with a vast majority of digit-type responses and see what the values of the non-digit responses are. We would likely expect those to be either NA or improperly inserted values. We can move to normalize those values if they relate to questions we care about. One way to do that is to substitute NA values or erroneous values with None or null values. This can be useful if you are going to run statistical methods on the columns in question.

Note

As you continue to work with your dataset, you will find anomalies in data types or NA responses. How best to handle these inconsistencies depends on your knowledge of the topic and dataset as well as what questions you are trying to answer. If you are combining datasets, you can sometimes throw out these outliers and bad data patterns; however, be wary of overlooking minor trends.

Now that we’ve started recognizing outliers and outlier patterns in our data, let’s work on eliminating bad data we may even have created ourselves—duplicates.

Finding Duplicates

If you are using more than one dataset with the same survey data or if you have used raw data that may have duplicate entries, removing duplicate data will be an important step in ensuring your data can be accurately used. If you have a dataset with unique identifiers, you can use those IDs to ensure you haven’t accidentally inserted or acquired duplicate data. If you do not have an indexed dataset, you might need to figure out a good way to identify each unique entry (such as creating an indexable key).

Python has some great ways to identify uniques in the built-in library. Let’s introduce some of the concepts here:

list_with_dupes = [1, 5, 6, 2, 5, 6, 8, 3, 8, 3, 3, 7, 9]

set_without_dupes = set(list_with_dupes)

print set_without_dupes

Your output should look something like this:

{1, 2, 3, 5, 6, 7, 8, 9}

What’s happening here? Set and frozenset are Python built-in types which allow us to take an iterable object (like a list, or a string, or a tuple) and create a set containing only the unique values.

Warning

In order to use set and frozenset, the values need to be hashable. With hashable types, we can apply a hash method and the resulting value will always be the same. This means, for example, that we can trust a 3 is the same as every other 3 we see in code.

Most Python objects are hashable—only lists and dictionaries are not. We can create sets using set with any collection of hashable types (integers, floats, decimals, strings, tuples, etc.). The other neat thing about sets and frozensets is they have some fast comparison properties. Let’s take a look at some examples:

first_set = set([1, 5, 6, 2, 6, 3, 6, 7, 3, 7, 9, 10, 321, 54, 654, 432])

second_set = set([4, 6, 7, 432, 6, 7, 4, 9, 0])

print first_set.intersection(second_set) 

print first_set.union(second_set) 

print first_set.difference(second_set) 

print second_set - first_set 

print 6 in second_set 

print 0 in first_set

: The intersection method of a set returns the intersection between two sets (i.e., the elements held in commom). A built-in Venn diagram!
: The union method of a set combines the values of the first set and the second set.
: The difference method shows the difference between the first set and the second set. Order of operations matters, as you’ll see in the next line.
: Subtracting one set from another shows the difference between them. Changing the order of the difference sets changes the order of the result (just like in math).
: in tests set membership (with very fast performance).

Your output should look like this:

set([432, 9, 6, 7])
set([0, 1, 2, 3, 4, 5, 6, 7, 9, 10, 321, 432, 654, 54])
set([1, 2, 3, 5, 321, 10, 654, 54])
set([0, 4])
True
False

Sets have quite a lot of useful features for defining unique datasets and comparing sets. There are many times in data wrangling where we need to know the minimum and maximum of a series of values, or we need a union of unique keys. Sets can help us with those tasks.

Aside from sets, Python has several other libraries with easy ways to test uniqueness. One library you can use for uniqueness is numpy, a powerful mathmatics library for Python with scientific and statistical methods and classes. numpy has superior array, numerical, and mathematical capabilities compared to the core Python libraries. It also has a great method called unique used with a numpy array. You can install numpy like so:

pip install numpy

Let’s take a look at how numpy’s unique works:

import numpy as np

list_with_dupes = [1, 5, 6, 2, 5, 6, 8, 3, 8, 3, 3, 7, 9]

print np.unique(list_with_dupes, return_index=True) 

array_with_dupes = np.array([[1, 5, 7, 3, 9, 11, 23], [2, 4, 6, 8, 2, 8, 4]]) 

print np.unique(array_with_dupes)

: Numpy’s unique method keeps track of the indexes. Passing return_index=True results in a tuple of arrays: the first is an array of the unique values, and the second is a flattened array of the indexes—only the first occurrence of every number will be present.
: To show more numpy, this line creates a numpy matrix. This is an array of arrays (equally sized).
: unique creates a unique set out of a matrix.

Your output will look like this:

(array([1, 2, 3, 5, 6, 7, 8, 9]), array([ 0,  3,  7,  1,  2, 11,  6, 12]))
[ 1  2  3  4  5  6  7  8  9 11 23]

If you don’t have unique keys, you can write a function to create a unique set. It can be as simple as using list comprehension. Let’s try one using Python’s sets for our dataset. First, we determine a unique number by taking a look at which data in our dataset is unique:

for x in enumerate(zipped_data[0]):
    print x

.....

(0, (['HH1', 'Cluster number', ''], '1'))
(1, (['HH2', 'Household number', ''], '17'))
(2, (['LN', 'Line number', ''], '1'))
(3, (['MWM1', 'Cluster number', ''], '1'))
(4, (['MWM2', 'Household number', ''], '17'))
(5, (['MWM4', "Man's line number", ''], '1'))

We see the first five elements of each row have some presumably unique identifiers. Assuming we are understanding the data properly, the cluster, household, and man’s line numbers should create a unique combination. It’s possible the line numbers are unique as well. Let’s see if we are correct:

set_of_lines = set([x[2][1] for x in zipped_data]) 

uniques = [x for x in zipped_data if not set_of_lines.remove(x[2][1])] 

print set_of_lines

: First, we make a set containing the line numbers of the survey. The line number is the third element in each response and the value is the second element of that row (x[2][1]). We use list comprehension to speed up our code.
: set_of_lines now holds the unique keys. We can use the set object’s remove method to see if we have more than one of each of those keys in the dataset. If the line number is unique, it will remove each key only once. If we have duplicates, remove will throw a KeyError to let us know that key is no longer in our set.

Hmm. We did see an error when we ran the code, so we were wrong in our assumption the line numbers were unique. If we take a closer look at the set we made, it looks like line numbers go from 1–16 and then repeat.

Note

You’ll often have to work with messy datasets, or datasets similar to this one with no obvious unique key. Our suggestion in times like these is to determine a good way to find a unique key and then use that as a comparison.

We have numerous options for creating a unique key. We could use the start time of the interview. However, we aren’t sure whether UNICEF deploys many survey teams at once; and if so, we could remove items marked as duplicates which are actually not duplicates. We could use the birth date of the man combined with the time of the interview, as that’s unlikely to have matches, but if we had any missing fields that could be problematic.

One nice solution is to see if a combination of the cluster, household, and line numbers creates a unique key. If so, then we could use this method across datasets—even ones without a start and end time. Let’s give it a try!

set_of_keys = set([
    '%s-%s-%s' % (x[0][1], x[1][1], x[2][1]) for x in zipped_data]) 

uniques = [x for x in zipped_data if not set_of_keys.remove(
    '%s-%s-%s' % (x[0][1], x[1][1], x[2][1]))] 

print len(set_of_keys)

: Makes a string out of the three parts we think are unique: the cluster number, the household number, and the line number. We are separating each with a - so we can differentiate between the three values.
: Re-creates the unique key we used, and uses the remove feature. This will remove those entries one by one, and the uniques list will hold every unique row. If there is a duplicate entry, our code will again throw an error.
: Evaluates the length our list of unique keys. This can show us how many unique entries we have in our dataset.

Super! This time we have no errors. We can see by the length of our list that each row is a unique entry. This is what we would expect from a processed dataset, as UNICEF does some data cleanup before publishing and ensures there are no duplicates. If we were combining this data with other UNICEF data, we might add M in our key because it’s the men’s group survey. We could then cross reference households that carry the same numbers.

Depending on your data, the unique key might not be obvious. Birth dates and addresses might be a good combination. The chances there are two 24-year-old women at the same address with the exact same date of birth are slim, although not out of the question if they are twins who live together!

Moving on from duplicates, we’ll now take a look at fuzzy matching, a great way to find duplicates in particularly noisy datasets.

Fuzzy Matching

If you are using more than one dataset or unclean, unstandardized data, you might use fuzzy matching to find and combine duplicates. Fuzzy matching allows you to determine if two items (usually strings) are “the same.” While not as in-depth as using natural language processing or machine learning to determine a match with big datasets on language, fuzzy matching can help us relate “My dog & I” and “me and my dog” as having similar meaning.

There are many ways to go about fuzzy matching. One Python library, developed by SeatGeek, uses some pretty cool methods internally to match tickets being sold online for different events. You can install it by using:

pip install fuzzywuzzy

So let’s say you’re dealing with some unclean data. Maybe it was input sloppily or is user-entered and contains misspellings and small syntactic errors or deviations. How might you account for that?

from fuzzywuzzy import fuzz

my_records = [{'favorite_book': 'Grapes of Wrath',
               'favorite_movie': 'Free Willie',
               'favorite_show': 'Two Broke Girls',
              },
              {'favorite_book': 'The Grapes of Wrath',
               'favorite_movie': 'Free Willy',
               'favorite_show': '2 Broke Girls',
              }]


print fuzz.ratio(my_records[0].get('favorite_book'),
                 my_records[1].get('favorite_book')) 

print fuzz.ratio(my_records[0].get('favorite_movie'),
                 my_records[1].get('favorite_movie'))

print fuzz.ratio(my_records[0].get('favorite_show'),
                 my_records[1].get('favorite_show'))

: Here we use the fuzz module’s ratio function, which expects two strings to compare. It returns the similarity of the sequencing of the strings (a value between 1 and 100).

We can see from our own understanding of popular culture and English that these two entries have the same favorites; however, they have spelled them differently. FuzzyWuzzy helps us counter these unintentional mistakes. We can see our matches using ratio scored pretty high. This gives us some level of confidence the strings are similar.

Let’s try another FuzzyWuzzy method and see our results. We’ll use the same data for the sake of simplicity and comparison:

print fuzz.partial_ratio(my_records[0].get('favorite_book'),
                         my_records[1].get('favorite_book')) 

print fuzz.partial_ratio(my_records[0].get('favorite_movie'),
                         my_records[1].get('favorite_movie'))

print fuzz.partial_ratio(my_records[0].get('favorite_show'),
                         my_records[1].get('favorite_show'))

: Here we call the fuzz module’s partial_ratio function, which expects two strings to compare. It returns the similarity of the sequencing of the closest matching substrings (a value between 1 and 100).

Wow, we can see we are getting much higher numbers! The partial_ratio function allows us to compare substrings, which means we don’t need to worry if someone has forgotten a word (like in our book example) or used different punctuation. This means a closer match for all of our strings.

Warning

If your data has some simple inconsistencies, these are some great functions to help find the mismatches. But if your data has some large differences in meaning with a few characters’ difference, you might want to test similarity and difference. For example, “does” and “doesn’t” are quite different in meaning but not very different in spelling. In the first ratio example, these two strings wouldn’t score highly, but in the substring, we would have a match. Knowledge of your data and the complexities therein is a must!

FuzzyWuzzy also has some other cool options. Let’s explore some of them, as they might pertain to your data cleanup needs:

from fuzzywuzzy import fuzz

my_records = [{'favorite_food': 'cheeseburgers with bacon',
               'favorite_drink': 'wine, beer, and tequila',
               'favorite_dessert': 'cheese or cake',
              },
              {'favorite_food': 'burgers with cheese and bacon',
               'favorite_drink': 'beer, wine, and tequila',
               'favorite_dessert': 'cheese cake',
               }]

print fuzz.token_sort_ratio(my_records[0].get('favorite_food'), 
                            my_records[1].get('favorite_food'))

print fuzz.token_sort_ratio(my_records[0].get('favorite_drink'),
                            my_records[1].get('favorite_drink'))

print fuzz.token_sort_ratio(my_records[0].get('favorite_dessert'),
                            my_records[1].get('favorite_dessert'))

: Here we call the fuzz module’s token_sort_ratio function, which allows us to match strings despite word order. This is great for free-form survey data, where “I like dogs and cats” and “I like cats and dogs” mean the same thing. Each string is first sorted and then compared, so if they contain the same words in a different order, they will match.

From our output, we can see using tokens (here, words) gives us a pretty good chance at matching word order differences. Here we see the favorite drink options are the same, just in different orders. We can use this when the order of tokens doesn’t change the meaning. For SeatGeek, “Pittsburgh Steelers vs. New England Patriots” is the same as “New England Patriots vs. Pittsburgh Steelers” (with the exception of home field advantage).

Let’s take a look at another token-oriented function from FuzzyWuzzy, using our same data:

print fuzz.token_set_ratio(my_records[0].get('favorite_food'), 
                           my_records[1].get('favorite_food'))

print fuzz.token_set_ratio(my_records[0].get('favorite_drink'),
                           my_records[1].get('favorite_drink'))

print fuzz.token_set_ratio(my_records[0].get('favorite_dessert'),
                           my_records[1].get('favorite_dessert'))

: Here we use the fuzz module’s token_set_ratio function, which uses the same token approach but compares sets of the tokens to see intersection and difference. The function attempts to find the best possible match of sorted tokens and returns the ratio of similarity for those tokens.

Here we can see an unintended side effect if we are not aware of similarities and differences in our dataset. One answer was improperly spelled. We know cheesecake and cheese are not the same thing, but using the token set approach, these resulted in a false positive. And sadly, we could not properly match our cheeseburger answer, even though it’s the same. Can you do so using another method we’ve already learned?

One final matching method FuzzyWuzzy provides is the process module. This is great if you have a limited amount of choices or options and messy data. Let’s say you know the answer has to be yes, no, maybe, or decline to comment. Let’s take a look at how we can match these up:

from fuzzywuzzy import process

choices = ['Yes', 'No', 'Maybe', 'N/A']

process.extract('ya', choices, limit=2) 

process.extractOne('ya', choices) 

process.extract('nope', choices, limit=2)

process.extractOne('nope', choices)

: Uses FuzzyWuzzy’s extract method to compare strings to the list of possible matches. The function returns two possible matches from the list of choices we have declared in our choices variable.
: Uses FuzzyWuzzy’s extractOne method to return only the best match between our string and the list of available choices.

Aha! Given a variety of words we know “mean” the same thing, process can extract the best guess—and in these cases, the correct guess. With extract we get tuples with the ratios returned, and our code can parse through the string responses and compare how similar or different the matches are. The extractOne function just finds the best match and returns it along with its ratio as a tuple. Depending on your needs, you might opt for extractOne to help you simply find the best match and move along.

You’ve learned all about matching strings now, so let’s talk a little about how to write some similar string matching functions on your own.

RegEx Matching

Fuzzy matching may not always fit your needs. What if you just need to identify a part of a string? What if you are only looking for a phone number or an email address? These are problems you’ll encounter if you are scraping your data (as we will learn about in Chapter 11) or compiling raw data from numerous sources. For a lot of these problems, regular expressions can help.

Regular expressions allow computers to match, find, or eliminate patterns in strings or data defined in the code. Regular expressions, or regex, are often feared by developers since they can become complex and can be difficult to read and understand. However, they can be quite useful, and a basic introduction to them can help you read, write, and understand when regex can help solve your problem.

Despite their fearsome reputation, the basic regex syntax is fairly straightforward and easy to learn. Table 7-1 covers the fundamentals.

Table 7-1. Regex basics
Character/Pattern	Legend	Example match
\w	Any alphanumeric character, including underscores	`a` or `0` or `_`
\d	Any digit	`1` or `2` or `4`
\s	Any whitespace character	`' '`
+	One or more (greedy) of the pattern or character	`\d`+ matches `476373`
\.	The `.` character	.
*	Zero or more (greedy) of the character or pattern (think of this almost as an `if`)	`\d*` matches `03289` and `''`
\|	Either the first pattern, or the next, or the next (like `OR`)	`\d`\|`\w` matches `0` or `a`
[] or ()	Character classes (defining what you expect to see in one character space) and character groupings (defining what you expect to see in a group)	`A` matches `[A-C]` or (A\|B\|C)
-	Binds character groups	`[0-9]`+ matches `\d+`

For more examples, we recommend bookmarking a good regex cheat sheet.

Note

There’s no need, especially as a Python developer, to memorize regex syntax; however, well-written regex can help you in some great ways. With Python’s built-in regex module, re, you can easily search for basic matches and groupings.

Let’s take a look at some of the possibilities regex gives us:

import re

word = '\w+' 
sentence = 'Here is my sentence.'

re.findall(word, sentence) 

search_result = re.search(word, sentence) 

search_result.group() 

match_result = re.match(word, sentence) 

match_result.group()

: Defines a basic pattern of a normal string. This pattern accounts for strings that contain letters and numbers but not spaces or punctuation. It will match until it doesn’t match (i.e., the + makes it greedy! nomnom!).
: The re module’s findall method locates all the pattern matches in a string. Every word we included in our sentence is found, but the period is missing. In this case, we used the pattern \w, so punctuation and spaces are not included.
: The search method allows us to search for a match throughout the string. If a match is found, a match object is returned.
: The match object’s group method returns the matched string.
: The match method searches only from the beginning of the string. This operates differently from search.

We can easily match words in a sentence, and depending on our needs, we can vary how we find them. In this example, we saw findall return a list of all the matches. Let’s say you only want to extract websites from a long text. You can use a regex pattern to find links and then use findall to extract all the links from the text. Or you can find telephone numbers, or dates. The findall method is your go-to tool if you can define what you are looking for in a simple pattern and apply it easily to your string data.

We also used search and match, which in this case, returned the same thing—they both matched the first word in the sentence. We returned a match object, and we were able to access the data using the group method. The group method will also take parameters. Try using .group(0) with one of your matches. What happened? What do you think the 0 means? (Hint: think of lists!)

search and match are actually quite different. Let’s use them in a few more examples so we can see the differences:

import re

number = '\d+' 
capitalized_word = '[A-Z]\w+' 

sentence = 'I have 2 pets: Bear and Bunny.'

search_number = re.search(number, sentence)

search_number.group() 

match_number = re.match(number, sentence)

match_number.group() 

search_capital = re.search(capitalized_word, sentence)

search_capital.group()

match_capital = re.match(capitalized_word, sentence)

match_capital.group()

: Defines a numerical pattern. The plus sign makes it greedy, so it will gobble up all the digits it can until it reaches a non-digit character.
: Defines a capitalized word match. This pattern uses the square brackets to define a part of a longer pattern. These brackets tell our pattern the first letter we want is a capital letter. Following that, we are just looking for a continued word.
: What happens when we call group here? We see the match object returned from our search method.
: What result would you expect to see here? Likely the number, but instead we get an error. Our match returns None, not a match object.

Now we can see the differences between search and match more clearly. We were unable to find a good match with match, despite the fact that we had matches for each of the searches we tried. How come? As mentioned previously, match starts from the very beginning of the string, and if it doesn’t find a match there, it returns None. In contrast, search will keep going until it finds a match. Only if it reaches the very end of the string without finding any matches will it return None. So, if you need to assert that a string matches or begins with an exact pattern, match is your friend. If you are only looking for the first occurrence or any match in the string, then search is the best option.

There was also a quick lesson on regex syntax here—did you catch it? What capitalized word did you expect to find first? Was it “I” or “Bear”? How come we didn’t catch “I”? What pattern would have matched both of them? (Hint: refer to the table and take a look at what wildcard variables you can pass!)

Now that we have a better understanding of regex syntax and how to use it with match, search, and findall, let’s see if we can create some patterns where we need to reference more than one group. In the previous examples, we only had one pattern group, so when we called the group method on a match, we only had one value. With regex, however, you can find more than one pattern, and you can give your found matched groups variable names so it’s easier to read your code and you can be sure you have matched the proper group.

Let’s try it out!

import re

name_regex = '([A-Z]\w+) ([A-Z]\w+)' 

names = "Barack Obama, Ronald Reagan, Nancy Drew"

name_match = re.match(name_regex, names) 

name_match.group()

name_match.groups() 

name_regex = '(?P<first_name>[A-Z]\w+) (?P<last_name>[A-Z]\w+)' 

for name in re.finditer(name_regex, names): 
    print 'Meet {}!'.format(name.group('first_name'))

: Here we use the same capital word syntax twice, putting it in parentheses. Parentheses are used to define groups.
: Here we use the pattern with more than one regex group in our match method. This will now return more than one group if it finds a match.
: The groups method on our match shows a list of all the matches of groups we found.
: Naming our groups helps our code be clear and explicit. In this pattern, the first group is first_name and the second group is last_name.
: finditer is similar to findall, but it returns an iterator. We can view the matches in our string one by one using this iterator.
: Using our knowledge of string formatting, we print out our data. Here we only pull the first name out of each match.

Naming pattern groups using ?P<variable_name> creates code that’s easy to understand. As our example shows, it’s also quite easy to create groups to capture two (or more) particular patterns and their matching data. These techniques help take the guesswork out of reading someone else’s (or your six-month-old) regex. Can you write another example to match middle initials, if there are any?

The power of regex lets you quickly discern what’s in your strings and parse data from your strings easily. They’re invaluable when it comes to parsing really messy datasets, like ones you get from web scraping. For more reading on regex, we recommend trying out the interactive regex parser at RegExr as well as walking through the free Regular-Expressions.info tutorial.

Now that you have many methods to match things, you can easily find duplicates. Let’s review our choices when it comes to duplicates we find in our datasets.

What to Do with Duplicate Records

Depending on the state of your data, you may want to combine your duplicate records. If your dataset simply has duplicate rows, there is no need to worry about preserving the data; it is already a part of the finished dataset and you can merely remove or drop these rows from your cleaned data. If, however, you are combining different sets of data and you wish to preserve pieces of the duplicate entries, you will want to figure out how to best do so using Python.

We will review some comprehensive ways to join data in Chapter 9, using some new libraries. However, you can easily combine rows of data in the original way you parsed them. Let’s walk through an example of how to do so if you are using DictReader to ingest your data. We’ll combine some rows of our male dataset. This time, we want to combine the data based on household, so we can look at the surveys on a house-by-house basis rather than a man-by-man basis:

from csv import DictReader

mn_data_rdr = DictReader(open('data/unicef/mn.csv', 'rb')) 

mn_data = [d for d in mn_data_rdr]

def combine_data_dict(data_rows): 
    data_dict = {} 
    for row in data_rows:
        key = '%s-%s' % (row.get('HH1'), row.get('HH2')) 
        if key in data_dict.keys():
            data_dict[key].append(row) 
        else:
            data_dict[key] = [row] 
    return data_dict 

mn_dict = combine_data_dict(mn_data) 

print len(mn_dict)

: We use the DictReader module so we can easily parse all of the fields we want.
: We define a function so we can reuse it with other UNICEF datasets. We are going to call it combine_data_dict because the function will take data_rows, combine them, and return a dictionary.
: This defines our new data dictionary to return.
: Like we did in our earlier example where we created a unique key from the cluster, household, and line numbers, this code sets a unique key. “HH1” represents the cluster number and “HH2” represents the household number. This code uses these to map unique households.
: If the household has already been added, this code extends the list representing the data by adding the current row to the list.
: If the household has not yet been added, this line adds a list with the current row of data.
: At the end of our function, this code returns the new data dictionary.
: Now we run the function by passing our rows of data and assigning the new dictionary to a variable we can use. This code sets the final dictionary to mn_dict, which we can now use to see how many unique households we have and how many surveys we have per household.

Warning

If you forget the return at the end of a function, your function will return None. Be on the lookout for return errors as you begin to write your own functions.

We found approximately 7,000 unique households, meaning a little over 2,000 men who were interviewed shared a household. The average number of men per household for this interview was 1.3. Simple aggregations like this can give us some larger insights into our data and help us conceptualize what it means and what questions we can answer with the data we have available.

Summary

In this chapter, you learned the basics of data cleanup and why it’s an essential step in your data wrangling process. You’ve seen some raw MICS data and interacted with it firsthand. You’re now able to look at data and identify where you might have data cleanup issues. You can now also find and remove erroneous data and duplicates.

The new concepts and libraries introduced in this chapter are detailed in Table 7-2.

Table 7-2. New Python and programming concepts and libraries
Concept/Library	Purpose
List generators	Enable quick and easy list assembly using an iterator, a function, and/or an if statement to further clean and process your data.
Dictionary `values` method	Returns a list of the dictionary’s values. Great for using to test membership.
`in` and `not in` statements	Test membership. Usually used with strings or lists. Return a Boolean value.
List `remove` method	Removes the first matching item passed from the list. Useful when you know exactly what you want out of an already-created list
`enumerate` method	Takes any iterable and returns a counter of what element you are on along with the value of that element as a tuple.
List `index` method	Returns the first matching index of the passed item in the list. If no match, returns `None`.
String `format` method	Enables you to easily make a readable string from a series of data. Uses `{}` as data placeholders and expects a matching number of data points to be passed. Can also be used with a dictionary using key names and can be used with a variety of string formatters.
String formatting (`.4f`, `.2%`, `,`)	Flags used to format numbers into easily readable strings.
`datetime` `strptime` and `strftime` methods	Enables you to easily format Python date objects into strings and create date objects out of strings.
`datetime` timedelta objects	Represents the difference between two Python date objects or modifies a date object (e.g., add or subtract time).
`if not` statements	Test whether the following statement is not `True`. Opposite Boolean logic from `if` statements.
`is` statements	Test if the first object is the same as the other object. Great for type testing (e.g., `is None`, `is list`). For more reading on `is`, check out Appendix E.
String `isdigit` and `isalpha` methods	Test if the string object contains only digits or only letters. Returns a Boolean.
String `find` method	Returns the index location of the passed substring in the string object. Will return `-1` if it can’t find a match.
Python set objects	A collection class of only unique elements. Behaves much like a list but with no duplicate values. Has numerous methods for comparison (`union`, `intersection`, `difference`).
`numpy` package	An essential mathematical Python library, used as part of the SciPy stack.
FuzzyWuzzy library	A library used for fuzzy matching of strings.
Regular expressions and the Python `re` library	Enable you to write patterns and find matches in strings.

As you move into the next chapter, you’ll keep honing those cleanup and data analysis skills and use them to better organize and replicate your cleanup tasks. We’ll review normalizing and standardizing data and how we can script and test our data cleanup.

Previous Chapter

6. Acquiring and Storing Data

Next Chapter

8. Data Cleanup: Standardizing and Scripting

Table of Contents for Data Wrangling with Python

Chapter 7. Data Cleanup: Investigation, Matching, and Formatting

Why Clean Data?

Note

Tip

Data Cleanup Basics

Identifying Values for Data Cleanup

Tip

Replacing headers

Tip

Zipping questions and answers

Tip

Tip

Formatting Data

Note

Tip

Note

Finding Outliers and Bad Data

Warning

Tip

Note

Note

Finding Duplicates

Warning

Note

Fuzzy Matching

Warning

RegEx Matching

Note

What to Do with Duplicate Records

Warning

Summary

Table of Contents for
Data Wrangling with Python