Now that you’ve spent time acquiring and cleaning your data, you are ready to start analyzing! It’s important to approach your data exploration with very few expectations for outcomes. Your question could be too broad for a singular answer, or it might not have a conclusive answer. Recall learning about hypotheses and conclusions in your first science course? It’s best to approach your data exploration with those same methods in mind—and with an understanding that you may not find a clear conclusion.
That said, just exploring the data and finding there are no trends or the trends don’t match your expectations is part of the fun. If everything was how we expected it to be, data wrangling would be a bit boring. We’ve learned to expect little and explore a lot.
As you begin to analyze and explore your data, you might realize you need more data or different data. That’s all part of the process and should be embraced as you further define the questions you aim to answer and examine what the data is telling you.
Now is also a great time to revisit the initial questions you had when you found your dataset(s). What do you want to know? Are there other related questions to aid your exploration? Those questions might point you in a direction where you find a story. If not, they might lead to other interesting questions. Even if you can’t answer your initial question, you can reach a greater understanding of the topic and discover new questions to explore.
In this chapter, we will learn about some new Python libraries for data exploration and analysis, and continue to apply what we’ve learned about cleaning our data from the previous two chapters. We’ll take a look at how to join datasets, explore the data, and come to statistical conclusions about relationships in our datasets.
You learned how to parse and clean your data in the preceding chapters, so you should be familiar with interacting with your data in Python. Now we will take a more in-depth look at exploring data with Python.
To begin with, we will install a helper library, agate, that will allow us to start looking at some basic features of our data. It’s a data-analysis library built by Christopher Groskopf, an accomplished data journalist and Python developer, and it can help us get to know our data. To install the library, use pip:
pip install agate
This chapter is compatible with agate 1.2.0. Because agate is a relatively young library, it’s possible some of this functionality will change as the library matures. To ensure you install a particular version of a library, you can set that as you use pip. For this book, you can install agate using: pip install agate==1.2.0. We recommend you also test the latest and feel free to keep track of code changes on the book’s repository.
We want to investigate some of the features of the agate library. To do that, we’ll be using the data we acquired from UNICEF’s annual report on child labor.
To begin, we’ll take a look at our first dataset—UNICEF’s child labor summary data. The data we downloaded was an Excel file containing listings of different percentages of child labor around the world. We can use what we learned about Excel files and data cleaning from Chapters 4 and 7 to get the data into a format accepted by the agate library.
As we work through the Excel sheet, we recommend having the sheet open in your preferred Excel viewer. This makes it easier to compare what Python “sees” with what you see in the sheet, facilitating navigation and extraction.
First, we want to import the libraries we anticipate needing and get the Excel file into an xlrd notebook:
importxlrdimportagateworkbook=xlrd.open_workbook('unicef_oct_2014.xls')workbook.nsheetsworkbook.sheet_names()
We now have our Excel data in a variable called workbook. This worksheet contains one sheet, named Child labour.
If you’re running this in your IPython terminal (recommended, as you’ll see more output), you should see the following:
In[6]:workbook.nsheetsOut[6]:1In[7]:workbook.sheet_names()Out[7]:[u'Child labour ']
Try selecting the sheet so you can import it into the agate library. According to the agate library documentation, it can import data with a list of titles, a list of types for the columns of data, and a data reader (or an iterable list of the data). So, we’ll need the data types so we can properly import the data from the sheet into the agate library:
sheet=workbook.sheets()[0]sheet.nrowssheet.row_values(0)forrinrange(sheet.nrows):r,sheet.row(r)

nrows identifies how many rows are in our sheet.

row_values allows us to select a single row and display its values. In this case, it shows the title, as it is on the first line of the Excel file.

By using range and a for loop to iterate over every row, we can see each line as Python sees it. The sheet’s row method will return some information on the data and data type for each row.
We know from Chapter 3, the csv library takes an open file and turns it into an iterator. An iterator is an object we can iterate or loop over, returning each of its values one at a time. In code, an iterator is a more efficient way of unpacking data than a list, as it has speed and performance advantages.
Because we’re working with a relatively small dataset, we can create a list and pass it in place of an iterator. Most libraries that require an iterator work fine with any iterable object (like a list). This way, we’re still complying with what our xlrd and agate libraries expect.
First, let’s get the titles of our columns. From our previous output, we can see the titles are in 4 and row 5. We will use zip to grab our title rows:
title_rows=zip(sheet.row_values(4),sheet.row_values(5))title_rows
Now we can see the value of our title_rows variable is:
[('',u'Countries and areas'),(u'Total (%)',''),('',''),(u'Sex (%)',u'Male'),('',u'Female'),(u'Place of residence (%)',u'Urban'),('',u'Rural'),(u'Household wealth quintile (%)',u'Poorest'),('',u'Second'),('',u'Middle'),('',u'Fourth'),('',u'Richest'),(u'Reference Year',''),(u'Data Source','')]
Using both rows retains extra information we would lose if we chose only one. It’s a perfect match and we could spend some extra time improving this, but for an initial exploration of the data, it’s a good first start. The title data is currently in a list of tuples. We know the agate library expects a tuple list where the first value is the title strings, so we should turn our titles into a list of strings:
titles=[t[0]+' '+t[1]fortintitle_rows]titlestitles=[t.strip()fortintitles]
In this code, we use two list generators. In the first one, we pass our title_rows list, which is a list of tuples. In those tuples, we have strings from the Excel file’s title rows.
The first list generator takes both parts of the tuple (using indexing) to create one string. We add each of those tuple values together, using ' ' for readability. Now our titles list is just a list of strings—the tuples are gone! We’ve made the titles a bit messier though, as not every tuple has two values. By adding the space, we created some titles with leading spaces, like ' Female'.
To remove the leading spaces, in the second iterator we use the strip string method, which removes leading and trailing spaces. Now our titles variable has a clean list of strings for use with the agate library.
Our titles are sorted out, so now we need to choose which lines to use from our Excel file. Our sheet has both country and continent data, but let’s focus on the country data. We want to avoid accidentally mixing in the totals with our data. We know from our previous code output that lines 6–114 have the data we want to use. We will use the row_values method to return the values of the rows from the xlrd sheet object:
country_rows=[sheet.row_values(r)forrinrange(6,114)]
Now we have our titles and our data list, so we only need to define the types to import it into the agate library. According to the documentation on defining columns, we have text, Boolean, number, and date columns, and the library’s authors advise us to use text if we are unsure of the type. There is also a built-in TypeTester we
can use to guess the types. To begin, we are going to use some of the xlrd built-ins to help define our columns:
fromxlrd.sheetimportctype_textimportagatetext_type=agate.Text()number_type=agate.Number()boolean_type=agate.Boolean()date_type=agate.Date()example_row=sheet.row(6)example_rowexample_row[0].ctypeexample_row[0].valuectype_text

By performing a visual check on this row, we see we have good data. Other than one empty column, xlrd is identifying all of the data.

In these lines, we call the ctype and value attributes to get the type and value attributes of each cell.
You can easily find new methods and attributes when using IPython by creating a new object out of something you’re curious to see and adding a period at the end and pressing Tab. This will populate a list of attributes and methods you can further explore.

Using the ctype_text object from the xlrd library, we can match up the integers returned by the ctype method and map them to something readable. This is a great alternative to mapping types by hand.
This code gives us a better idea of the tools we have available to define types. The ctype method and ctype_text object can be used to order and show data types given the example row.
Although it can seem like a lot of work to create lists this way, they provide reusability that will save you time. Reusing code snippets will save you countless tedious hours later and is a fun aspect of writing your own code.
Now we know what functions we can use to investigate Excel column types, so we need to try to make a list of types for our agate library. We will need to iterate over our example row and use ctype to match the column types:
types=[]forvinexample_row:value_type=ctype_text[v.ctype]ifvalue_type=='text':types.append(text_type)elifvalue_type=='number':types.append(number_type)elifvalue_type=='xldate':types.append(date_type)else:types.append(text_type)

Maps the integers we found when we explored the ctype attribute of each row with the ctype_text dictionary to make them readable. Now value_type holds the column type string (i.e., text, number, etc.).

Uses if and elif statements with the == operator to match value_type with the agate column types. Then, the code appends the proper type to the list and moves on to the next column.

As advised by the library’s documentation, if there is no type match, we append the text column type.
Now we’ve constructed a function to take an empty list, iterate over the columns, and create a full list of all of the column types for our dataset. After running the code we have our types, our titles, and a list of our data. We can zip the titles with the types and try importing the result into our agate table by running the following line of code:
table=agate.Table(country_rows,titles,types)
When you run the code you should see a CastError, with the message Can not convert value "-" to Decimal for NumberColumn.
As we covered in Chapters 7 and 8, learning how to clean your data is an essential part of data wrangling. Writing well-documented code allows you to save time in the future.
By reading this error message, we realize we have some bad data lying around in one of our number columns. Somewhere in our sheet, the data is placing '-' instead of '', which would be properly processed as null. We can write a function to handle this problem:
defremove_bad_chars(val):ifval=='-':returnNonereturnvalcleaned_rows=[]forrowincountry_rows:cleaned_row=[remove_bad_chars(rv)forrvinrow]cleaned_rows.append(cleaned_row)

Defines a function to remove bad characters (like '-' in an integer column)

If the value is equal to '-', selects this value to be replaced

If the value is '-', returns None

Iterates through country_rows to create newly cleaned rows with the proper data

Creates a cleaned_rows list holding the clean data (using the append method)
When we write functions to modify values, keeping a default return outside of the main logic (like in this example) ensures we always return a value.
By using this function, we can make sure our integer columns have None types instead of '-'. None tells Python that it’s null data, and to ignore it when analyzing it in comparison with other numbers.
Because it seems like this type of cleaning and changing might be something we’ll want to reuse, let’s take some of the code we have already written and make it into a more abstract and generic helper function. When we created our last cleaning function, we made a new list, iterated over all of the rows, and then iterated over each individual row to clean the data and return a new list for our agate table. Let’s see if we can take those concepts and abstract them:
defget_new_array(old_array,function_to_clean):new_arr=[]forrowinold_array:cleaned_row=[function_to_clean(rv)forrvinrow]new_arr.append(cleaned_row)returnnew_arrcleaned_rows=get_new_array(country_rows,remove_bad_chars)

Defines our function with two arguments: the old data array, and the function to clean the data.

Reuses our code with more abstract names. At the end of the function, returns the new clean array.

Calls the function with the remove_bad_chars function and saves it in cleaned_rows.
Now let’s retry our code to create the table:
In[10]:table=agate.Table(cleaned_rows,titles,types)In[11]:tableOut[11]:<agate.table.Tableat0x7f9adc489990>
Hooray! We have a table variable holding a Table object. We can now look at our data using the agate library functions. If you’re curious what the table looks like, have a quick look by using the print_table method like so:
table.print_table(max_columns=7)
If you’ve been following along using IPython and you’d like to ensure you have these variables in your next session, use %store. If we want to save our table we can simply type %store table. In our next IPython session we can restore table by typing %store -r. This will be useful to “save” your work as you analyze your data.
Next, we’ll take a deeper look at our table using some built-in research tools.
The agate library gives us many functions to investigate our data. First, we’ll try out the sorting methods. Let’s try ordering our table so we can see the most egregious nations by ordering using the total percentage column. We will use the limit method to see the top 10 offenders:
table.column_namesmost_egregious=table.order_by('Total (%)',reverse=True).limit(10)forrinmost_egregious.rows:r

Checks the column names so we know what column to use.

Chains the order_by and limit methods to create a new table. Because order_by will order from least to greatest, we are using the reverse argument to say we’d like to see the largest numbers first.

Using the new table’s rows attribute, iterates through the top 10 worst countries for child labor.
Running the code returns a list of the 10 countries with the highest incidence of child labor. In terms of percentages of children working, there are a high number of African countries at the top of the list. It’s our first interesting find! Let’s keep exploring. To investigate which countries have the most girls working, we can use the order_by and limit functions again. This time, we need to apply them to the Female percentage column:
most_females=table.order_by('Female',reverse=True).limit(10)forrinmost_females.rows:'{}: {}%'.format(r['Countries and areas'],r['Female'])
When first exploring your data, use the Python format function to make your output easier to read rather than simply printing out each row. This means you can remain focused on the data instead of struggling to read it.
We see we have some None percentages. That isn’t what we expected! We can remove those using the agate table’s where method, as shown in the following code. This method is similar to a SQL WHERE statement or a Python if statement. where creates another table including only the fitting rows:
female_data=table.where(lambdar:r['Female']isnotNone)most_females=female_data.order_by('Female',reverse=True).limit(10)forrinmost_females.rows:'{}: {}%'.format(r['Countries and areas'],r['Female'])
First, we created the female_data table, which uses the Python lambda function to ensure each row has a value for the Female column. The where function takes the Boolean value from the lambda function and separates out only the rows where it returns True. Once we’ve separated out the rows with only female child labor values, we use the same sorting, limiting, and formatting technique to see the countries with the highest incidence of female child labor.
In reviewing the data, we see many of the same countries we saw in our overall percentages. We’ve reviewed a bit of filtering and sorting, so let’s take a look at some of the built-in statistical functions in the agate library. Say we wanted to find the average percentage of child labor in cities. To do so, we would pull the mean out of the column Place of residence (%) Urban:
table.aggregate(agate.Mean('Place of residence (%) Urban'))
In this code snippet, we call the table’s aggregate method, using the agate.Mean() statistical method and the column name to return the numerical mean of that column. You can see other statistical aggregates you can use on columns in the agate documentation.
When you run the code, you should receive a NullComputationWarning. As you can probably guess from the name and our previous experience, this means we likely have some null rows in the Place of residence (%) Urban column. We can again use the where method to focus on the urban averages:
has_por=table.where(lambdar:r['Place of residence (%) Urban']isnotNone)has_por.aggregate(agate.Mean('Place of residence (%) Urban'))
You’ll notice you get the same value—this is because agate just does the same thing (removing null columns and computing the average of what’s left) behind the scenes. Let’s take a look at some of the other math we can do with the place of residence table. We can see the minimum (Min), maximum (Max), and average (Mean) of the place of residence columns.
Say we want to find one of the rows with more than 50% of rural child labor. The agate library has a find method that uses a conditional statement to find the first match. Let’s try writing out our question in code:
first_match=has_por.find(lambdax:x['Rural']>50)first_match['Countries and areas']
The row returned is the first match, and we can see the name as we would in a normal dictionary. One nice final step we’d like to do in our first exploration of the agate library is to use the compute method alongside the agate.Rank() statistical method to add a ranked column based on the values from another column.
Ranking your data based on one column is a great way to do a good “gut check” when you’re comparing datasets.
To see the ranks of the worst offenders in terms of child labor percentages, we can use the Total (%) column and rank the data accordingly. Before we join this data with other datasets, we’ll want an easy-to-see rank column to help us compare the joined data. Because we want the countries with the highest percentages to appear at the top of the list, we need to rank descending order by using the reverse=True argument:
ranked=table.compute([('Total Child Labor Rank',agate.Rank('Total (%)',reverse=True)),])forrowinranked.order_by('Total (%)',reverse=True).limit(20).rows:row['Total (%)'],row['Total Child Labor Rank']
If we wanted to calculate the rank in another way, we could create a column with the inverse percentages. Instead of having the total percentage of children involved in child labor in each country, we could have the percentage of children not involved in child labor. This would then allow us to use the agate.Rank() method without reverse:
defreverse_percent(row):return100-row['Total (%)']ranked=table.compute([('Children not working (%)',agate.Formula(number_type,reverse_percent)),])ranked=ranked.compute([('Total Child Labor Rank',agate.Rank('Children not working (%)')),])forrowinranked.order_by('Total (%)',reverse=True).limit(20).rows:row['Total (%)'],row['Total Child Labor Rank']

Creates a new function to calculate and return the inverse percentage if given a row.

Uses the agate library’s compute method, which adds new columns when passed a list. Each list item should be a tuple whose first item contains the column name and whose second item computes the new column. Here, we are using the Formula class, which also requires an agate type, alongside the function to create that column value.

Creates the Total Child Labor Rank column with a proper ranking using our Children not working (%) column.
As you can see, compute is a great tool to calculate a new column based on another column (or a few other columns). Now that we have our ranking, let’s see if we can join some new data to our child labor dataset.
When investigating datasets to join with our child labor data, we hit a lot of dead ends. We tried to compare agricultural versus service economies using WorldBank data, but didn’t find any good links. We did more reading and found some people correlated child labor with HIV rates. We took a look at those datasets but did not find a compelling overall trend. Along those same lines, we wondered if homicide rates had an effect on child labor rates—but again, we found no clear link.1
After many such dead ends, an errant thought occurred while perusing the data and reading some more articles. Would government corruption (or perceived government corruption) affect child labor rates? When reading stories about child labor, there are often links with antigoverment militias, schools, and industries. If a populace doesn’t trust the government and must create non-state-sanctioned spaces, this could be a reason to enlist all those willing to work and help (even children).
We located Transparency International’s Corruption Perceptions Index and decided to compare this dataset with our UNICEF child labor data. First, we needed to import the data into Python. Here is how to do that:
cpi_workbook=xlrd.open_workbook('corruption_perception_index.xls')cpi_sheet=cpi_workbook.sheets()[0]forrinrange(cpi_sheet.nrows):r,cpi_sheet.row_values(r)cpi_title_rows=zip(cpi_sheet.row_values(1),cpi_sheet.row_values(2))cpi_titles=[t[0]+' '+t[1]fortincpi_title_rows]cpi_titles=[t.strip()fortincpi_titles]cpi_rows=[cpi_sheet.row_values(r)forrinrange(3,cpi_sheet.nrows)]cpi_types=get_types(cpi_sheet.row(3))
We are again using xlrd to import the Excel data and reusing the code we’ve written to parse our titles and get the data ready to import in our agate library. But before you can run the last item, which calls a new function, get_types, we need to write some code to help define types and create a table:
defget_types(example_row):types=[]forvinexample_row:value_type=ctype_text[v.ctype]ifvalue_type=='text':types.append(text_type)elifvalue_type=='number':types.append(number_type)elifvalue_type=='xldate':types.append(date_type)else:types.append(text_type)returntypesdefget_table(new_arr,types,titles):try:table=agate.Table(new_arr,titles,types)returntableexceptExceptionase:e
We are using the same code we wrote earlier to create the function get_types, which takes an example row and outputs a list of the types for our agate library. We’ve also built a get_table function, which uses Python’s built-in exception handling.
You may be asking, why then do we have except Exception in the get_table function we have written? This is a great question! We always want to be specific in our code; however, when you are first experimenting with a library or a dataset, you might not know what errors to anticipate.
To write specific exceptions, you need to predict what types of exceptions your code might throw. There are built-in Python exception types, but also special library exceptions that are unfamiliar to you. If you are using an API library, the authors might write a RateExceededException indicating you are sending too many requests. When a library is new to us, using an except Exception block with print or logging will help us learn more about these errors.
When you write an except block, you can store the exception in a variable e by adding as e at the end of your exception line (before the colon). Because we are printing the e variable holding the exception, we can learn more about the raised exceptions. Eventually we will rewrite the except Exception block with more specific exceptions, or a series of exception blocks, so our code runs smoothly and predictably.
Now we have a get_table function to track our agate library exceptions and anticipate ways to improve our code. We can use our new functions to get the perceived corruption index data into our Python code. Try running this:
cpi_types=get_types(cpi_sheet.row(3))cpi_table=get_table(cpi_rows,cpi_types,cpi_titles)
Payoff! When you run the code, instead of the function breaking completely, our new get_table function allows you to see the thrown errors. Duplicate titles probably mean we have some bad titles in our title list. Check it out by running this:
pci_titles
We can see the problem: we have two of the Country Rank columns. By looking at the Excel data in the spreadsheet, we see we do indeed have duplicate columns. For expediency, we’re not going to worry about removing the duplicate data, but we do need to handle the duplicate column names. We should add Duplicate to one of them. Here’s how to do that:
cpi_titles[0]=cpi_titles[0]+' Duplicate'cpi_table=get_table(cpi_rows,cpi_types,cpi_titles)
We are replacing the first title with Country Rank Duplicate and trying again to make our new pci_table:
cpi_rows=get_new_array(cpi_rows,float_to_str)cpi_table=get_table(cpi_rows,cpi_types,cpi_titles)
Now we have our cpi_table without any errors. We can work on joining it with our child labor data and see what connections we can make between the two datasets. In the agate library, we have an easy-to-use method for joining tables: the join method. This join method emulates SQL by joining two tables together based on one shared key. Table 9-1 summarizes the different joins and their
functionality.
| Join type | Function |
|---|---|
Left outer join |
Preserves all rows from the left table (or first table in the |
Right outer join |
Uses the right table (second table in the |
Inner join |
Returns only the rows matching both tables using the shared key(s). |
Full outer join |
Preserves all rows from both tables, still combining rows on the shared key(s) when they align properly. |
If your data doesn’t exactly match up or have a one-for-one relationship and you are using an outer join, you will have rows with null values. When the tables don’t match up, an outer join keeps the data from the tables intact and replaces missing data with null values. This is great if you’d like to keep the mismatched data because it’s essential for your reporting.
If we wanted to join table_a and table_b but make sure we didn’t lose any table_a data, we would write something like this:
joined_table=table_a.join(table_b,'table_a_column_name','table_b_column_name')
In the resulting joined_table, we will have all of the table_a values that match with the table_b values based on the column names we passed. If there are values in table_a that don’t match table_b, we will keep those rows, but they will have null values for the table_b columns. If there are values in table_b not matched in table_a, they will be excluded from our new table. Choosing which table to place first and specifying what type of join to use is important.
What we want, however, is not to have null values. Our questions revolve around how the values correlate, so for that, we want to use an inner join. The agate library’s join method allows us to pass inner=True, which will make an inner join retaining only matching rows, with no null rows from the join.
We’ll try a join with our child labor data and our newly formed cpi_table. When looking at our two tables, we can likely match them up on the names of the countries/territories. In our cpi_table we have the column Country / Territory, and in the child labor data we have the Countries and areas column. To join the two tables, run the following line of code:
cpi_and_cl=cpi_table.join(ranked,'Country / Territory','Countries and areas',inner=True)
Our new table, cpi_and_cl, has our matching rows. We can see this by printing out a few of the values and investigating the new joined columns, like so:
cpi_and_cl.column_namesforrincpi_and_cl.order_by('CPI 2013 Score').limit(10).rows:'{}: {} - {}%'.format(r['Country / Territory'],r['CPI 2013 Score'],r['Total (%)'])
When you look at the column names, you can see we now have all of the columns from both tables. A simple count of the data returns 93 rows. We don’t need all of the data points (pci_table has 177 rows, ranked has 108), especially since we really want to see the data correlated together. Did you notice anything else when you printed out the new joined table after sorting by CPI score? We only took the top 10 rows, but some interesting information is becoming clear:
Afghanistan: 8.0 - 10.3% Somalia: 8.0 - 49.0% Iraq: 16.0 - 4.7% Yemen: 18.0 - 22.7% Chad: 19.0 - 26.1% Equatorial Guinea: 19.0 - 27.8% Guinea-Bissau: 19.0 - 38.0% Haiti: 19.0 - 24.4% Cambodia: 20.0 - 18.3% Burundi: 21.0 - 26.3%
With the exception of Iraq and Afghanistan, there are some pretty high child labor rates among the countries with very low CPI scores (i.e., high perception of corruption). Using some of the agate library’s built-in methods, we can investigate such correlations in our datasets.
The agate library has some great tools for simple statistical analysis of your datasets. These are a good first toolset—you can often start with the agate library tools and then move on to more advanced statistical libraries, including pandas, numpy, and scipy, as needed.
We want to determine whether perceived government corruption and child labor rates are related. The first tool we’ll use is a simple Pearson’s correlation. agate is at this point in time working on building this correlation into the agate-stats library. Until then, you can correlate using numpy. Correlation coefficients (like Pearson’s) tell us if data is related and whether one variable has any effect on another.
If you haven’t already installed numpy, you can do so by running pip install numpy. Then, calculate the correlation between child labor rates and perceived government corruption by running the following line of code:
importnumpynumpy.corrcoef(cpi_and_cl.columns['Total (%)'].values(),cpi_and_cl.columns['CPI 2013 Score'].values())[0,1]
We first get an error which looks similar to the CastError we saw before. Because numpy expects floats, not decimals, we need to convert the numbers back to floats. We can use list comprehension for this:
numpy.corrcoef([float(t)fortincpi_and_cl.columns['Total (%)'].values()],[float(s)forsincpi_and_cl.columns['CPI 2013 Score'].values()])[0,1]
Our output shows a slight negative correlation:
-0.36024907120356736
A negative correlation means as one variable increases, the other variable decreases. A positive correlation means the numbers increase or decrease together. Pearson’s correlation values range from –1 to 1, with 0 meaning no correlation and –1 and 1 meaning very strong correlations.
Our value of –.36 indicates a weak correlation, but a correlation nevertheless. We can use this knowledge to dive deeper into these datasets and what they mean.
As your data analysis progresses, you will want to use some other statistical methods to interpret your data. One starting point is to identify outliers.
Outliers occur when particular rows of data signficantly differ from other parts of the dataset. Outliers usually tell us part of the story. Sometimes removing them will show a significant trend. Other times they tell a story in and of themselves.
With the agate library, finding outliers is easy. There are two different methods: one uses standard deviations, and the other uses median absolute deviations. If you have studied some statistics and would like to rely on one or the other, feel free to do so! If not, analyzing both measures of variance and deviation in your datasets may unveil different revelations.2
If you already know the distribution of your data, you can apply the right ways to determine variance; but when you first explore your data, try looking at more than one method to determine the distribution and to learn more about how your data’s composition.
We’re going to use the agate table’s standard deviation outlier method. This method returns a table of values at least three deviations above or below the
mean. Here is how you can see the standard deviation outliers using your agate table.
If you are working with data in IPython and need to install a new library, you can use IPython’s magic %autoreload to reload your Python environment after installing the library in a different terminal. Try %load_ext autoreload and then %autoreload. Violà! You have the new library without losing your progress.
First, you will need to install the agate-stats library by running pip install agate-stats. The run the following code:
importagatestatsagatestats.patch()std_dev_outliers=cpi_and_cl.stdev_outliers('Total (%)',deviations=3,reject=False)len(std_dev_outliers.rows)std_dev_outliers=cpi_and_cl.stdev_outliers('Total (%)',deviations=5,reject=False)len(std_dev_outliers.rows)

Uses our child labor Total (%) column and the agate-stats stdev_outliers method to see if our child labor data has easy-to-find standard deviation outliers. We assign the output of this method to a new table, std_dev_outliers. We use the argument reject=False to specify we want to see the outliers. If we set reject to True, we would get just the values that are not outliers.

Checks how many rows of outliers were found. (The table has 94 rows total.)

Increases the number of deviations to find fewer outliers. (deviations=5).
We can see from the output that we don’t have a good grip on the distribution of the data. When we used the Total (%) column to try to identify outliers using three standard deviations, we got a table matching our current table. This is not what we want. When we used five deviations, we did not see a change in the result. This is telling us our data is not very regularly distributed. In order to figure out the actual variance in our data, we are going to have to investigate further and determine if we need to refine our data to a subset of the countries we are investigating.
We can test the varience of the Total (%) column using the mean absolute deviation:
mad=cpi_and_cl.mad_outliers('Total (%)')forrinmad.rows:r['Country / Territory'],r['Total (%)']
Interesting! We did indeed identify a much smaller table of outliers, but we got a strange list of results:
Mongolia 10.4 India 11.8 Philippines 11.1
When we look at the list, we don’t see any of the top or bottom of our sample. This means that our dataset likely doesn’t play by the normal statistical rules for identifying outliers.
Depending on your dataset and the distribution of data, these two methods often do a great job of showing your data’s story in a meaningful way. If they don’t, as in the case of our dataset, move on and figure out what relationships and trends your data can tell you about.
Once you’ve explored the distribution of your data and the trends that distribution reveals, you’ll want to explore grouped relationships in your data. The following section explains how to group your data.
To further explore our dataset, we are going to create groupings and investigate their relationships. The agate library provides several tools to create groupings and other methods which allow us to aggregate those groupings and determine connections between them. Earlier we had continental data intact for our child labor dataset. Let’s try grouping the data geographically by continent and see if this reveals any connections with our perceived corruption data or allows us to draw any conclusions.
First, we need to figure out how to get the continent data. In this book’s repository, we have provided a .json file listing every country by continent. Using this data, we can add a column showing each country’s continent allowing us to group by continent. Here’s how we do that:
importjsoncountry_json=json.loads(open('earth.json','rb').read())country_dict={}fordctincountry_json:country_dict[dct['name']]=dct['parent']defget_country(country_row):returncountry_dict.get(country_row['Country / Territory'].lower())cpi_and_cl=cpi_and_cl.compute([('continent',agate.Formula(text_type,get_country)),])

Uses the json library to load the .json file. If you take a look at the file, you’ll see it’s a list of dictionaries.

Loops through the country_dict and adds the country as the key and the continent as the value.

Creates a function that, when given a country row, returns the continent. It uses the Python string’s lower method, which replaces capital letters with lowercase ones. The .json file has all lowercase country names.

Creates a new column, continent, using the get_country function. We keep the same table name.
Now we have continents with our country data. We should do a quick check to make sure we didn’t miss anything. To do so, run this code:
forrincpi_and_cl.rows:r['Country / Territory'],r['continent']
Hmm, it looks like we have some missing data because we can see None types for some of our countries:
DemocraticRepublicoftheCongoNone...EquatorialGuineaNoneGuinea-BissauNone
We’d rather not lose this data, so let’s take a look at why these rows aren’t matching. We want to only print out the lines that have no match. We can use agate to help us find them by running this code:
no_continent=cpi_and_cl.where(lambdax:x['continent']isNone)forrinno_continent.rows:r['Country / Territory']
Your output should be:
Saint Lucia Bosnia and Herzegovina Sao Tome and Principe Trinidad and Tobago Philippines Timor-Leste Democratic Republic of the Congo Equatorial Guinea Guinea-Bissau
There’s only a short list of countries with no continent data. We recommend just cleaning up the earth.json data file, as this will make it easier to use the same data file for joining this same data at a future time. If you instead use code to find the exceptions and match them, it will be hard to repeat with new data and you’ll need to change it every time.
In order to fix our matching in the .json file, we should figure out why the countries were not found. Open up the earth.json file and find a few of the countries from our no_continent table. For example:
{"name":"equatorial Guinea","parent":"africa"},....{"name":"trinidad & tobago","parent":"north america"},...{"name":"democratic republic of congo","parent":"africa"},
As we can see from looking at our .json file, there are some small differences preventing us from properly finding the continents for these countries. This book’s repository also contains a file called earth-cleaned.json, which is the earth.json file with the necessary changes made, such as adding the to the DRC entry and changing & to and for several countries. We can now rerun our code from the beginning of this section with the new file as our country_json data. You will need to start by rejoining the table so you don’t have
duplicate columns (using the same code we used earlier to join the two tables). After you’ve rerun those two pieces of code you should have no unmatched countries.
Let’s try to group our now-complete continent data by contintent and see what we find. The following code does just that:
grp_by_cont=cpi_and_cl.group_by('continent')grp_by_contforcont,tableingrp_by_cont.items():cont,len(table.rows)

Uses the agate library’s group_by method, which returns a dictionary where the keys are the continent names and the values are new tables containing rows for that continent.

Iterates over the dictionary items to see how many rows are in each table. We are assigning the key/value pairs from items to the cont and table variables, so cont represents the key or continent name and table represents the value or table of matching rows.

Prints our data to review our groupings. We are using Python’s len function to count the number of rows we have for each table.
When we run that code, we get the following (note you may have a different order):
north america 12 europe 12 south america 10 africa 41 asia 19
We can see a numerical concentration in Africa and Asia compared to the other continents. This interests us, but group_by doesn’t easily give us access to aggregate data. If we want to start aggregating our data and creating summed columns, we should take a look at the aggregation methods in the agate library.
We notice the agate table’s aggregate method, which takes a grouped table and a series of aggregate operations (like a sum) to calculate new columns based on the grouping.
After looking at the aggregate documentation, we are most interested in how the continents compare across perceived corruption and child labor. We want to use some statistical methods to take a look at the group as a whole (using Median and Mean) but also to identify the most egregious (Min for the CPI score and Max for the total child labor percentage). This should give us some nice comparisons:
agg=grp_by_cont.aggregate([('cl_mean',agate.Mean('Total (%)')),('cl_max',agate.Max('Total (%)')),('cpi_median',agate.Median('CPI 2013 Score')),('cpi_min',agate.Min('CPI 2013 Score'))])agg.print_table()

Calls the aggregate method on our grouped table and passes a list containing tuples of new aggregate column names and agate aggregation methods (which utilize column names to compute the values for the new columns). We want the mean and max of the child labor percentage column and the median and min of the corruption perception score. You can use different aggregate methods depending on your questions and data.

Prints the new table so we can visually compare our data.
When you run that code, you should see this result:
|----------------+-----------------------------+--------+------------+----------| | continent | cl_mean | cl_max | cpi_median | cpi_min | |----------------+-----------------------------+--------+------------+----------| | south america | 12,710000000000000000000000 | 33,5 | 36,0 | 24 | | north america | 10,333333333333333333333333 | 25,8 | 34,5 | 19 | | africa | 22,348780487804878048780487 | 49,0 | 30,0 | 8 | | asia | 9,589473684210526315789473 | 33,9 | 30,0 | 8 | | europe | 5,625000000000000000000000 | 18,4 | 42,0 | 25 | |----------------+-----------------------------+--------+------------+----------|
If we wanted to take a closer look at some other charts surrounding our data, we could use the agate table’s print_bars method, which takes a label column (here, continent) and a data column (here, cl_max) to chart the child labor maximum in our iPython session. Its output is as follows:
In[23]:agg.print_bars('continent','cl_max')continentcl_maxsouthamerica33,5▓░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░northamerica25,8▓░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░africa49,0▓░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░asia33,9▓░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░europe18,4▓░░░░░░░░░░░░░░░░░░░░+----------+-----------------+--------------+-------------+0,012,525,037,550,0
Now we have several easy-to-compare outputs of our continent data, and the picture is showing some trends. We notice Africa has the highest mean of the total child labor percentage column. It also has the highest maximum value, followed by Asia and South America. The comparatively low means for Asia and South America suggest that there are one or more outliers in these regions.
We see a fairly even median across our perceived corruption data, with Europe scoring the highest (i.e., least amount of corruption perceived). However, when we look at the minimums (or worst perceived corruption scores), we can see that Africa and Asia again represent the “worst” scores.
This shows there are several stories we can investigate further in these datasets. We were able to see a link (albeit weak) between perceived corruption and child labor. We were also able to investigate which countries and which continents are the worst offenders for child labor and perceived corruption. We can see Africa has a high rate of child labor and fairly high perceived corruption. We know in Asia and South America one or two countries might stand out in terms of child labor compared to their neighbors.
Our aggregation and exploration have only taken us so far. We can continue to use the tables we’ve created to tell more stories and investigate further.
There are some other powerful features in the agate library, and some other interesting statistical libraries that you can use to experiment with your own datasets.
Depending on your data and what your questions are, you may find some of these features and libraries more useful than others, but we strongly encourage you to find ways to experiment with a variety of tools. It will deepen your understanding of Python and data analysis libraries as well as your data analysis itself.
The agate-stats library has some interesting statistical methods we haven’t yet investigated. You can keep track of the new releases and functionality on GitHub.
In addition, we recommend continuing to play around using numpy. You can use numpy to calculate percentiles. You can also expand into using the scipy library and play around with the z score statistical methods for determining outliers.
If you have time-sensitive data, numpy has the ability to calculate column-by-column changes between data to investigate changes over time. agate can also compute change columns with time-sensitive data. Don’t forget to use the date type when forming your date columns, as this will enable you to do some interesting date analysis (such as percent change over time or time series mapping).
If you want to investigate with even more statistics, install the latimes-calculate library. This library has many statistical methods as well as some interesting geospatial analysis tools. If you have access to geospatial data, this library can provide you with some valuable tools to better understand, map, and analyze your data.
If you’d like to take your statistics one step further, we highly recommend Wes McKinney’s book Python for Data Analysis (O’Reilly). It introduces you to some of the more robust Python data analysis libraries, including pandas, numpy, and the scipy stack.
Take time to play around and explore your data with some of the methods and lessons we’ve already reviewed. We will now move on to analyzing our data further and determining some ways we can draw conclusions and share our knowledge.
Once you’ve played with a few more examples from the agate library’s Cookbook (an assortment of different methods and tools to use for investigation), you probably have enough familiarity with your data to begin your analysis.
What is the difference between data exploration and analysis? When we analyze data, we ask questions and attempt to answer them using the data at our disposal. We might combine and group datasets to create a statistically valid sampling. With exploration, we simply want to investigate trends and attributes of our datasets without trying to answer specific questions or come to conclusions.
With some basic analysis, we can attempt to determine the answers to the questions we uncovered in our exploration:
Why do there seem to be higher frequencies of child labor in Africa?
What child labor outliers exist in Asia and South America?
How do perceived corruption and child labor tie together?
For your dataset, you will have different questions, but try to follow our examples and find trends you’d like to investigate. Any statistical outliers or aggregation tendencies can point you to interesting questions to research.
To us, the most interesting question for our particular dataset concerns the connection between perceived corruption and child labor in Africa. Does government corruption, or the perception of government corruption, affect how communities are able to prohibit child labor?
Depending on the datasets you are using and what your data exploration has shown, you may have a variety of questions you are interested in pursuing. Try to focus on a specific question and answer it using your analysis. Repeat this for as many specific questions as you’d like. Focusing will help you determine good answers and keep your analysis clear.
Answering this question will require more investigation and more datasets. We might want to read more articles to see what’s been written on the topic. We might also want to call and interview experts in the field. Finally, we might want to focus and choose a particular region in Africa or series of countries to better evaluate the story of child labor. The next section explains how to do that.
For further analysis, we first need to separate out our data for African nations and investigate this subset of data more fully. We already know a lot of ways to filter with our agate library, so let’s start there. Here’s how to separate out the African data from the other data:
africa_cpi_cl=cpi_and_cl.where(lambdax:x['continent']=='africa')forrinafrica_cpi_cl.order_by('Total (%)',reverse=True).rows:"{}: {}%- {}".format(r['Country / Territory'],r['Total (%)'],r['CPI 2013 Score'])importnumpynumpy.corrcoef([float(t)fortinafrica_cpi_cl.columns['Total (%)'].values()],[float(c)forcinafrica_cpi_cl.columns['CPI 2013 Score'].values()])[0,1]africa_cpi_cl=africa_cpi_cl.compute([('Africa Child Labor Rank',agate.Rank('Total (%)',reverse=True)),])africa_cpi_cl=africa_cpi_cl.compute([('Africa CPI Rank',agate.Rank('CPI 2013 Score')),])

Uses the where table method to filter only rows where the continent is africa.

Prints the rows with some formatting so we can view our data for a “gut check.” We want to make sure we have only African countries and we can see our total child labor percentages and CPI scores.

Shows whether the Pearson’s correlation has changed after separating out the most interesting data.

Adds a new ranking to show how the countries within our subset of data rank up just against one another.
With this subset of the data, we calculated a new Pearson’s correlation:
-0.404145695171
Our Pearson’s correlation decreased, showing a slightly stronger relationship between child labor and perceived corruption in our African data than in the global data.
Now let’s see if we can identify good stories and find data points we’d like to investigate. We are going to find the mean values for perceived corruption and child labor percentages and show the countries with the highest child labor and worst perceived corruption (i.e., where the values are worse than the mean). Here’s how to do that:
cl_mean=africa_cpi_cl.aggregate(agate.Mean('Total (%)'))cpi_mean=africa_cpi_cl.aggregate(agate.Mean('CPI 2013 Score'))defhighest_rates(row):ifrow['Total (%)']>cl_meanandrow['CPI 2013 Score']<cpi_mean:returnTruereturnFalsehighest_cpi_cl=africa_cpi_cl.where(lambdax:highest_rates(x))forrinhighest_cpi_cl.rows:"{}: {}%- {}".format(r['Country / Territory'],r['Total (%)'],r['CPI 2013 Score'])

Pulls out the column averages we are most interested in: corruption score and child labor percentages.

Creates a function to identify countries with high child labor rates and low CPI scores (i.e., high corruption).

Returns True or False from our highest_rates function, which selects a row. This lambda asks whether the country has higher than average child labor rates and perceived corruption.
When we run the code we see some interesting output. Of particular interest are these rows:
Chad: 26.1% - 19.0 Equatorial Guinea: 27.8% - 19.0 Guinea-Bissau: 38.0% - 19.0 Somalia: 49.0% - 8.0
Our output shows some data in the “middle” that is not too far off the mean, but then these worst offenders with lower corruption scores and higher child labor percentages. Because we are interested in why there are high child labor rates and how corruption affects child labor, these would be our best case studies.
As we continue our research, we want to identify what is happening in these specific countries. Are there films or documentaries related to young people or child labor in these countries? Are there articles or books written on the topic? Are there experts or researchers we can contact?
When we look more deeply into these countries, we see some stark realities: child trafficking, sexual exploitation, maliciously acting religious groups, street and domestic labor. Are these realities connected to disenfranchisement? To a public who cannot trust the government? Can we trace commonalities among these countries and their neighbors? Can we determine elements or actors helping ease the problems?
It would be interesting to look into the effects of political and generational changes over time. We could peruse the backlog of UNICEF data or focus in one country and utilize UNICEF’s Multiple Indicator Cluster Survey data to understand changes through the decades.
For your own dataset, you need to determine what possibilities you have for future exploration. Can you find more data for your investigation? Are there people you can interview or trends you can identify over a long period of time? Are there books, movies, or articles on the topic that can shed more light? Your analysis is the beginning of future research.
Now that we’ve explored and analyzed our data, we can begin figuring out what the data is telling us. As we experienced when we first started looking at our child labor data, sometimes your data has no connections, it tells no story, and it’s not correlated. That’s OK to find out!
Sometimes finding no correlation tells you to keep researching to find actual existing connections. Sometimes not finding a connection is a revelation in and of itself.
In data analysis, you search for trends and patterns. Most of the time, as we saw with our child labor data, analysis is a starting poing for further research. As much as your numbers tell a story, adding a human voice or another angle is a great way to expand on the connections and questions revealed by your analysis.
If you find some connections, even weak ones, you can dig deeper. Those connections lead to better questions and more focused research. As we saw with our child labor data, the more focused we became in our research, the easier it was to see connections. It’s great to start broad, but important to finish with a more refined view.
Once you’ve analyzed your data and understand the connections, you can start determining what you can conclude. It’s essential you have a real understanding of your datasets and the topic, so you can have a firm backing for your ideas. With your data analysis, interviews, and research completed, your conclusions are formed, and you simply need to determine how to share them with the world.
If you have trouble finding a definitive conclusion, it’s OK to include open questions in your findings. Some of the biggest stories start with just a few simple questions.
If you can cast light on the topic and point out the need for more documentation, research, and action in order to draw complete conclusions, that is an important message itself. As we found with our investigation, it’s hard to say if government corruption causes high child labor rates, but we can say there is a weak correlation between the two, and we’d like to research and analyze the way they are linked—particularly in certain African nations.
Once you’ve found some conclusions and more questions you’d like to research, you should begin documenting your work. As part of your documentation and final presentation, you should be clear about what sources you used and how many data points you analyzed. In our subset, we investigated only ~90 data points, but they represented the segment we wanted to study.
You may find the dataset you focus on is smaller than anticipated. As long as you are clear about your methods and the reasons for the smaller subset, you will not lead your audience or reporting astray. In the next chapter, we’ll dive deeper into reporting our findings, documenting our thoughts and processes as we share our conclusions with the world.
In this chapter, we explored and analyzed our datasets using some new Python libraries and techniques. You were able to import datasets, join them, group them, and create new datasets based on the findings.
You can now utilize statistical methods to find outliers and measure correlation. You can determine solid, answerable questions to investigate by separating out interesting groupings and diving deeper into your exploration. If you’ve been using IPython and %store to save your variables, we will be interacting more with them in the next chapter.
You should now feel comfortable:
Evaluating your data using the agate library
Determining what, if anything, is significant in your data
Finding holes in your data or parts of the data you’d need to further investigate to come to conclusions
Challenging your assumptions by analyzing and exploring your data
The new concepts and libraries we’ve covered are summarized in Table 9-2.
| Concept/Library | Purpose |
|---|---|
|
Data analysis made easy with the ability to easily read in data from a CSV, make tables for analysis, run basic statistical functions, and apply filters to gain insight into your dataset. |
|
Allow you to easily see what cell type your data is in when using |
|
Tests the type of a Python object. Returns a Boolean value if the types match. |
One-line functions in Python, great for simple filtering or parsing of your dataset. Be careful not to write a |
|
Joins (inner, outer, left, right) |
Allow you to join two different datasets on one or more matching fields. Depending on how you join your data (inner/outer and left/right), you will get different datasets. Take time to think about what join fits your needs. |
Enables you to anticipate and manage Python exceptions with code. It’s always better to be specific and explicit, so you don’t disguise bugs with overly general exception catches. |
|
|
Uses statistical models like Pearson’s correlation to determine whether two parts of a dataset are related. |
|
Use statistical models and tools like standard deviations or mean average deviations to determine whether your dataset has specific outliers or data that “doesn’t fit.” |
|
Group your dataset on a particular attribute and run aggregation analysis to see if there are notable differences (or similarities) across groupings. |
In the next chapter, you will learn how to use visualizations and storytelling tools to share your conclusions on the Web and in numerous other formats.
1 To take a look at some of those explorations, see the book’s repository.
2 For more reading on median absolute deviations and standard deviations, check out a great writeup by Matthew Martin on why we still use standard deviations and Stephen Gorad’s academic paper on why and when to use mean deviations.