Chapter 9. Where to Go from Here

Well, we’ve come to the final chapter in the book. We’ve covered a lot of material up to this point. We’ve covered Python basics and how to parse any number of text files, CSV files, Excel files, and data in databases. We’ve learned how to select specific rows and columns from these data sources, how to aggregate and calculate basic statistics using the data, and how to write the results to output files. We’ve tackled three common business analysis applications that require us to use the skills and techniques we’ve learned in creative and useful ways. We’ve also learned how to create some of the most common statistical plots with several add-in packages and how to estimate regression and classification models with the StatsModels package. Finally, we’ve learned how to schedule our scripts to run automatically on a regular basis so we have time to work on other interesting analytical problems. If you’ve followed along with and carried out all of the examples in this book, then I hope you feel like you’ve transitioned from non-programmer to competent hacker.

At this point, you might be wondering where you go from here. That is, what else is there to learn about using Python to scale and automate data analysis? In this chapter, I’ll mention some additional capabilities of the standard Python distribution that are interesting and useful but weren’t necessary for you to learn at the very beginning. Having gone through the preceding chapters in this book, hopefully you will find these additional capabilities easier to understand and handy extensions to the techniques you’ve learned so far.

I’ll also discuss the NumPy, SciPy, and Scikit-Learn add-in packages, because they provide foundational data containers and vectorized operations, scientific and statistical distributions and tests, and statistical modeling and machine learning functions that other packages such as pandas rely on and which go beyond those in the StatsModels package. For example, Scikit-Learn provides helpful functions for preprocessing data; reducing the dimensionality of the data; estimating regression, classification, and clustering models; comparing and selecting among competing models; and performing cross-validation. These methods help you create, test, and select models that will be robust to new data so that any predictions based on the models and new data are likely to be accurate.

Lastly, I am also going to discuss some additional data structures that are helpful to learn about as you become more proficient with Python. This book focused on list, tuple, and dictionary data structures because they are powerful, fundamental data containers that will meet your needs as a beginning programmer (and may be sufficient for your entire experience with Python). However, there are other data structures, like stacks, queues, heaps, trees, graphs, and others, that you will likely prefer to use for specific purposes.

Additional Standard Library Modules and Built-in Functions

We have explored many of Python’s built-in and standard library modules and functions that facilitate reading, writing, and analyzing data in text files, CSV files, Excel files, and databases. For example, we’ve used Python’s built-in csv, datetime, re, string, and sys modules. We’ve also used some of Python’s built-in functions, such as float, len, and sum.

However, we’ve really only scratched the surface of all of the modules and functions in Python’s standard library. In fact, there are some additional modules and functions I want to mention here because they are useful for data processing and analysis. These modules and functions didn’t make it into earlier chapters because either they didn’t fit into a specific example or they are advanced options, but it’s helpful to know that these modules and functions are available in case they can help with your specific analysis task. If you want to set yourself a challenge, try to learn at least one new skill from this list every day or every other day.

Python Standard Library (PSL): A Few More Standard Modules

collections (PSL 8.3.): This module implements specialized container data types as alternatives to Python’s other built-in containers: dict, list, set, and tuple. Some of the containers that tend to be used in data analyses are deque, Counter, defaultdict, and OrderedDict.
random (PSL 9.3.): This module implements pseudorandom number generators for various distributions. There are functions for selecting a random integer from a range; selecting a random element from a sequence; randomly permuting a sequence; randomly sampling without replacement; and selecting random values from uniform, normal (Gaussian), gamma, beta, and other distributions.
statistics (PSL 9.7.): This module provides functions for calculating some common statistics of numeric data. There are functions for calculating measures of central location, like mean, median, and mode. There are also functions for calculating measures of spread, like variance and standard deviation.
itertools (PSL 10.1.): This module provides a set of standardized, fast, memory-efficient iterators (i.e., generators) for several useful data algorithms. There are iterators for merging and splitting sequences, converting input values, producing new values, and filtering and grouping data.
operator (PSL 10.3.): This module provides a set of efficient functions that correspond to intrinsic operators in Python. There are functions for performing object comparisons, logical operations, mathematical operations, and sequence operations. There are also functions for generalized attribute and item lookups.

These five additional standard modules are a small subset of all of the modules available in Python’s standard library. In fact, there are over 35 sections in the standard library, each providing a wide variety of modules and functions related to specific topics. Because all of these modules are built into Python, you can use them immediately with an import statement, like from itertools import accumulate or from statistics import mode. To learn more about these and other standard modules, peruse the Python Standard Library.

Built-in Functions

Similar to the standard modules just discussed, there are also a few built-in functions that didn’t make it into earlier chapters but are still useful for data processing and analysis. As with the modules, it’s useful to know that these functions are available in case they can help with your specific analysis task. It’s helpful to have the following functions in your Python toolbox:

enumerate(): Expands a sequence into a list of (index, value) tuples
filter(): Applies a function to a sequence and returns the values that are true based on the function call
zip(): Combines two sequences into one sequence by joining pairs of values by position in the sequence

These functions are built into Python, so you can use them immediately. To learn more about these and other built-in functions, peruse the Python list of standard functions. In addition, it always helps to see how people use these functions to accomplish specific analysis tasks. If you’re interested in learning how others have used these functions, perform a quick Google or Bing search like “python enumerate examples” or “python zip examples” to retrieve a set of helpful examples.

Python Package Index (PyPI): Additional Add-in Modules

As we’ve seen, the standard Python installation comes with a tremendous amount of built-in functionality. There are modules for accessing text and CSV files, manipulating text and numbers, and calculating statistics, as well as a whole host of other capabilities that we’re not covering in this book.

However, we’ve also seen that add-in modules, like xlrd, matplotlib, MySQL-python, pandas, and statsmodels provide additional functionality that isn’t available in Python’s standard library. In fact, there are several important, data-focused add-in modules that, once downloaded and installed, provide significant functionality for data visualization, data manipulation, statistical modeling, and machine learning. A few of these are NumPy, SciPy, Scikit-Learn, xarray (formerly xray), SKLL, NetworkX, PyMC, NLTK, and Cython.

These add-in modules, along with many others, are available for download at the Python Package Index website. In addition, Windows users who need to differentiate between 32-bit and 64-bit operating systems can find 32-bit and 64-bit versions of many of the add-in packages at the Unofficial Windows Binaries for Python Extension Packages website.

NumPy

NumPy (pronounced “Num Pie”) is a foundational Python package that provides the ndarray, a fast, efficient, multidimensional data container for (primarily) numerical data. It also provides vectorized versions of standard mathematical and statistical functions that enable you to operate on arrays without for loops. Some of the helpful functions NumPy provides include functions for reading, reshaping, aggregating, and slicing and dicing structured data (especially numerical data).

As is the case with pandas, which is built on top of NumPy, many of NumPy’s functions encapsulate and simplify techniques you’ve learned in this book. NumPy is a fundamental package that underlies many other add-in packages (and provides the powerful ndarray data structure with vectorized operations), so let’s review some of NumPy’s functionality.

Reading and writing CSV and Excel files

In Chapter 2, we discussed how to use the built-in csv module to read and write CSV files. To read a CSV file, we used a with statement to open the input file and a filereader object to read all of the rows in the file. Similarly, to write a CSV file, we used a with statement to open the output file and a filewriter object to write to the output file. In both cases, we also used a for loop to iterate through and process all of the rows in the input file.

NumPy simplifies reading and writing CSV and text files with three functions: loadtxt, genfromtxt, and savetxt. By default, the loadtxt function assumes the data in the input file consists of floating-point numbers separated by some amount of whitespace; however, you can include additional arguments in the function to override these default values.

loadtxt

Instead of the file-reading code we discussed in Chapter 2, if your dataset does not include a header row and the values are floating-point numbers separated by spaces, then you can write the following statements to load your data into a NumPy array and immediately have access to all of your data:

from numpy import loadtxt
my_ndarray = loadtxt('input_file.csv')
print(my_ndarray)

From here, you can perform a lot of data manipulations similar to the ones we’ve discussed in this book. To provide another example, imagine you have a file, people.txt, which contains the following data:

name      age  color      score
clint     32   green      15.6
john      30   blue       22.3
rachel    27   red        31.4

Notice that this dataset contains a header row and columns that are not floating-point numbers. In this case, you can use the skiprows argument to skip the header row and specify separate data types for each of the columns:

from numpy import dtype, loadtxt
person_dtype = dtype([('name', 'S10'), ('age', int), ('color', 'S6'),\
('score', float)])
people = loadtxt('people.txt', skiprows=1, dtype=person_dtype)
print(people)

By creating person_dtype, you’re creating a structured array in which the values in the name column are strings with a maximum length of 10 characters, the values in the age column are integers, the values in the color column are strings, and the values in the score column are floating-point numbers.

In this example the columns are space-delimited, but if your data is comma-delimited you can use delimiter=',' in the loadtxt function to indicate that the columns are comma-delimited.

genfromtxt

The genfromtxt function attempts to simplify your life even further by automatically determining the data types in the columns. As with loadtxt, the genfromtxt function provides additional arguments you can use to facilitate reading different types of file formats and data into structured arrays.

For example, you can use the names argument to indicate that there’s a header row and you can use the converters argument to change and format the data you read in from the input file:

from numpy import genfromtxt
name_to_int = dict(rachel=1, john=2, clint=3)
color_to_int = dict(blue=1, green=2, red=3)
def convert_name(n):
    return name_to_int.get(n, -999)
def convert_color(c):
    return color_to_int.get(c, -999)
data = genfromtxt('people.txt', dtype=float, names=True, \
converters={0:convert_name, 2:convert_color})
print(data)

In this example, I want to convert the values in the name and color columns from strings to floating-point numbers. For each column, I create a dictionary mapping the original string values to numbers. I also define two helper functions that retrieve the numeric values in the dictionaries for each name and color, or return –999 if the name or color doesn’t appear in the dictionary.

In the genfromtxt function, the dtype argument indicates that all of the values in the resulting dataset will be floating-point numbers, the names argument indicates that genfromtxt should look for the column headings in the first row, and the converters argument specifies a dictionary that maps column numbers to the converter functions that will convert the data in these columns.

Convert to a NumPy array

In addition to using loadtxt and genfromtxt, you can also read data into a list of lists or list of tuples using base Python or read data into a DataFrame using pandas and then convert the object into a NumPy array.

CSV files

For example, imagine you have a CSV file, myCSVInputFile.csv, which contains the following data:

2.1,3.2,4.3
3.2,4.3,5.2
4.3,2.6,1.5

You can read this data into a list of lists using the techniques we discussed in this book and then convert the list into a NumPy array:

import csv
from numpy import array
file = open('myCSVInputFile.csv', 'r')
file_reader = csv.reader(file)
data = []
for row_list in file_reader:
    row_list_floats = [float(value) for value in row_list]
    data.append(row_list_floats)
file.close()
data = array(data)
print(data)

Excel files

Alternatively, if you have an Excel file, you can use the pandas read_excel function to read the data into a DataFrame and then convert the object into a NumPy array:

from pandas import read_excel
from numpy import array
myDataFrame = read_excel('myExcelInputFile.xlsx')
data = array(myDataFrame)
print(data)

savetxt

NumPy provides the savetxt function for saving data to CSV and other text files. First you specify the name of the output file and then you specify the data you want to save to the file:

from numpy import savetxt
savetxt('output_file.txt', data)

By default, savetxt saves the data using scientific format. You don’t always want to save data using scientific format, so you can use the fmt argument to specify the format you want to use. You can also include the delimiter argument to specify the column delimiter:

savetxt('output_file.txt', data, fmt='%d')
savetxt('output_file.csv', data, fmt='%.2f', delimiter=',')

Also, by default savetxt doesn’t include a header row. If you want a header row in the output file, you can provide a string to the header argument. By default, savetxt includes the hash symbol (#) before the first column header to make the row a comment. You can turn off this behavior by setting the comments argument equal to the empty string:

column_headings_list = ['var1', 'var2', 'var3']
header_string = ','.join(column_headings_list)
savetxt('output_file.csv', data, fmt='%.2f', delimiter=',', \
comments='', header=header_string)

Filter rows

Once you’ve created a structured NumPy array, you can filter for specific rows using filtering conditions similar to the ones you would use in pandas. For example, assuming you’ve created a structured array named data that contains at least the columns Cost, Supplier, Quantity, and Time to Delivery, you can filter for specific rows using conditions like the following:

row_filter1 = (data['Cost'] > 110) & (data['Supplier'] == 3)
data[row_filter1]
row_filter2 = (data['Quantity'] > 55) | (data['Time to Delivery'] > 30)
data[row_filter2]

The first filtering condition filters for rows where the value in the Cost column is greater than 110 and the value in the Supplier column equals 3. Similarly, the second filtering condition filters for rows where the value in the Quantity column is greater than 55 or the value in the Time to Delivery column is greater than 30.

Select specific columns

Selecting a subset of columns in a structured array can be challenging because of data type differences between the columns in the subset. You can define a helper function to provide a view of the subset of columns and handle the subset’s data types:

import numpy as np
def columns_view(arr, fields):
  dtype2 = np.dtype({name:arr.dtype.fields[name] for name in fields})
  return np.ndarray(arr.shape, dtype2, arr, 0, arr.strides)

Then you can use the helper function to view a subset of columns from the structured array. You can also specify row-filtering conditions to filter for specific rows and select specific columns at the same time, similar to using the ix function in pandas:

supplies_view = columns_view(supplies, ['Supplier', 'Cost'])
print(supplies_view)
row_filter = supplies['Cost'] > 1000
supplies_row_column_filters = columns_view(supplies[row_filter],\
['Supplier', 'Cost'])
print(supplies_total_cost_gt_1000_two_columns)

Concatenate data

NumPy simplifies the process of concatenating data from multiple arrays with its concatenate, vstack, r_, hstack, and c_ functions. The concatenate function is more general than the others. It takes a list of arrays and concatenates them together according to an additional axis argument, which specifies whether the arrays should be concatenated vertically (axis=0) or horizontally (axis=1). The vstack and r_ functions are specifically for concatenating arrays vertically, and the hstack and c_ functions are specifically for concatenating arrays horizontally. For example, here are three ways to concatenate arrays vertically:

import numpy as np
from numpy import concatenate, vstack, r_
array_concat = np.concatenate([array1, array2], axis=0)
array_concat = np.vstack((array1, array2))
array_concat = np.r_[array1, array2]

These three functions produce the same results. In each case, the arrays listed inside the function are concatenated vertically, on top of one another. If you assign the result to a new variable as I do here, then you have a new larger array that contains all of the data from the input arrays.

Similarly, here are three ways to concatenate arrays horizontally:

import numpy as np
from numpy import concatenate, hstack, c_
array_concat = np.concatenate([array1, array2], axis=1)
array_concat = np.hstack((array1, array2))
array_concat = np.c_[array1, array2]

Again, these three functions produce the same results. In each case, the arrays listed inside the function are concatenated horizontally, side by side.

Additional features

This section has presented some of NumPy’s features and capabilities, but there are many more that you should check out. One important difference between NumPy and base Python is that NumPy enables vectorized operations, which means you can apply operations to entire arrays element by element without needing to use a for loop.

For example, if you have two arrays, array1 and array2, and you need to add them together element by element you can simply write array_sum = array1 + array2. This operation adds the two arrays element by element, so the result is an array where the value in each position is the sum of the values in the same position in the two input arrays. Moreover, the vectorized operations are executed in C code, so they are carried out very quickly.

Another helpful feature of NumPy is a collection of statistical calculation methods that operate on arrays. Some of the statistical calculations are sum, prod, amin, amax, mean, var, std, argmin, and argmax. sum and prod calculate the sum and product of the values in an array. amin and amax identify the minimum and maximum values in an array. mean, var, and std calculate the mean, variance, and standard deviation of the values in an array. argmin and argmax find the index position of the minimum and maximum values in an array. All of the functions accept the axis argument, so you can specify whether you want the calculation down a column (axis=0) or across a row (axis=1).

For more information about NumPy, and to download it, visit the NumPy website.

SciPy

SciPy (pronounced “Sigh Pie”) is another foundational Python package that provides scientific and statistical distributions, functions, and tests for mathematics, science, and engineering. SciPy has a broad scope, so its functionality is organized into different subpackages. Some of the subpackages are:

cluster: Provides clustering algorithms
constants: Provides physical and mathematical constants
interpolate: Provides functions for interpolation and smoothing splines
io: Provides input/output functions
linalg: Provides linear algebra operations
sparse: Provides operations for sparse matrices
spatial: Provides spatial data structures and algorithms
stats: Provides statistical distributions and functions
weave: Provides C/C++ integration

As you can see from this list, SciPy’s subpackages provide functionality for a diverse range of operations and calculations. For example, the linalg package provides functions for performing very fast linear algebra operations on two-dimensional arrays; the interpolate package provides functions for linear and curvilinear interpolation between data points; and the stats package provides functions for working with random variables, calculating descriptive and test statistics, and conducting regression.

SciPy is a fundamental package that underlies many other add-in packages (in addition to providing a variety of useful mathematical and statistical functions), so let’s review some of SciPy’s functionality.

linalg

The linalg package provides functions for all of the basic linear algebra routines, including finding inverses, finding determinants, and computing norms. It also has functions for matrix decompositions and exponential, logarithm, and trigonometric functions. Some other useful functions enable you to quickly solve linear systems of equations and linear least-squares problems.

Linear systems of equations

SciPy provides the linalg.solve function for computing the solution vector of a linear system of equations. Suppose we need to solve the following system of simultaneous equations:

x + 2y + 3z = 3
2x + 3y + z = –10
5x – y + 2z = 14

We can represent these equations with a coefficient matrix, a vector of unknowns, and a righthand-side vector. The linalg.solve function takes the coefficient matrix and the righthand-side vector and solves for the unknowns (i.e., x, y, and z):

from numpy import array
from scipy import linalg
A = array([[1,2,3], [2,3,1], [5,-1,2]])
b = array([[3], [-10], [14]])
solution = linalg.solve(A, b)
print(solution)

The values for x, y, and z that solve the system of equations are: 0.1667, –4.8333, and 4.1667.

Least-squares regression

SciPy provides the linalg.lstsq function for computing the solution vector of a linear least-squares problem. In econometrics, it’s common to see a linear least-squares estimated model expressed in matrix notation as:

y = Xb + e

where y is a vector for the dependent variable, X is a matrix of coefficients for the independent variables, b is the solution vector of values to be estimated, and e is a vector of residuals computed from the data. The linalg.lstsq function takes the coefficient matrix, X, and the dependent variable, y, and solves for the solution vector, b:

import numpy as np
from scipy import linalg
c1, c2 = 6.0, 3.0
i = np.r_[1:21]
xi = 0.1*i
yi = c1*np.exp(-xi) + c2*xi
zi = yi + 0.05 * np.max(yi) * np.random.randn(len(yi))
A = np.c_[np.exp(-xi)[:, np.newaxis], xi[:, np.newaxis]]
c, resid, rank, sigma = linalg.lstsq(A, zi)
print(c)

Here, c1, c2, i, and xi simply serve to construct yi, the initial formulation of the dependent variable. However, the next line constructs zi, the variable that actually serves as the dependent variable, by adding some random disturbances to the yi values. The lstsq function returns values for c, residuals (resid), rank, and sigma. The two c values that solve this least-squares problem are 5.92 and 3.07.

interpolate

The interpolate package provides functions for linear and curvilear interpolation between known data points. The function for univariate data is named interp1d, and the function for multivariate data is named griddata. The package also provides functions for spline interpolation and radial basis functions for smoothing and interpolation. The interp1d function takes two arrays and returns a function that uses interpolation to find the values of new points:

from numpy import arange, exp
from scipy import interpolate
import matplotlib.pyplot as plt
x = arange(0, 20)
y = exp(-x/4.5)
interpolation_function = interpolate.interp1d(x, y)
new_x = arange(0, 19, 0.1)
new_y = interpolation_function(new_x)
plt.plot(x, y, 'o', new_x, new_y, '-')
plt.show()

The blue dots in the plot are the 20 original data points. The green line connects the interpolated values of new points between the original data points. Because I didn’t specify the kind argument in the interp1d function, it used the default linear interpolation to find the values. However, you can also specify quadratic, cubic, or a handful of other string or integer values to specify the type of interpolation it should perform.

stats

The stats package provides functions for generating values from specific distributions, calculating descriptive statistics, performing statistical tests, and conducting regression analysis. The package offers over eighty continuous random variables and ten discrete random variables. It has tests for analyzing one sample and tests for comparing two samples. It also has functions for kernel density estimation, or estimating the probability density function of a random variable from a set of data.

Descriptive statistics

The stats package provides several functions for calculating descriptive statistics:

from scipy.stats import norm, describe
x = norm.rvs(loc=5, scale=2, size=1000)
print(x.mean())
print(x.min())
print(x.max())
print(x.var())
print(x.std())
x_nobs, (x_min, x_max), x_mean, x_variance, x_skewness, x_kurtosis = describe(x)
print(x_nobs)

In this example, I create an array, x, of 1,000 values drawn from a normal distribution with mean equal to five and standard deviation equal to two. The mean, min, max, var, and std functions compute the mean, minimum, maximum, variance, and standard deviation of x, respectively. Similarly, the describe function returns the number of observations, the minimum and maximum values, the mean and variance, and the skewness and kurtosis.

Linear regression

The stats package simplifies the process of estimating the slope and intercept in a linear regression. In addition to the slope and intercept, the linregress function also returns the correlation coefficient, the two-sided p-value for a null hypothesis that the slope is zero, and the standard error of the estimate:

from numpy.random import random
from scipy import stats
x = random(20)
y = random(20)
slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
print("R-squared:", round(r_value**2, 4))

In this example, the print statement squares the correlation coefficient to display the R-squared value.

The preceding examples only scratched the surface of the subpackages and functions that are available in the SciPy package. For more information about SciPy, and to download it, visit the SciPy website.

Scikit-Learn

The Scikit-Learn add-in module provides functions for estimating statistical machine learning models, including regression, classification, and clustering models, as well as data preprocessing, dimensionality reduction, and model selection. Scikit-Learn’s functions handle both supervised models, where the dependent variable’s values or class labels are available, and unsupervised models, where the values or class labels are not available. One of the features of Scikit-Learn that distinguishes it from StatsModels is a set of functions for conducting different types of cross-validation (i.e., testing a model’s performance on data that was not used to fit the model).

Testing a model’s performance on the same data that was used to fit the model is a methodological mistake because it is possible to create models that repeat the dependent variable’s values or class labels perfectly with the data used to fit the model. These models might appear to have excellent performance based on their results with the data at hand, but they’re actually overfitting the model data and would not have good performance or provide useful predictions on new data.

To avoid overfitting and estimate a model that will tend to have good performance on new data, it is common to split a dataset into two pieces, a training set and a test set. The training set is used to formulate and fit the model, and the test set is used to evaluate the model’s performance. Because the data used to fit the model is different than the data used to evaluate the model’s performance, the chances of overfitting are reduced. This process of repeatedly splitting a dataset into two pieces—training a model on the training set and testing the model on the test set—is called cross-validation.

There are many different methods of cross-validation, but one basic method is called k-fold cross-validation. In k-fold cross-validation, the original dataset is split into a training set and a test set and then the training set is again split into k pieces, or “folds” (e.g., five or ten folds). Then, for each of the k folds, k – 1 of the folds are used as training data to fit the model and the remaining fold is used to evaluate the model’s performance. In this way, cross-validation creates several performance values, one for each fold, and the final performance measure for the training set is the average of the values calculated for each fold. Finally, the cross-validated model is run on the test data to calculate the overall performance measure for the model.

To see how straightforward it is to formulate statistical learning models in Scikit-Learn, let’s specify a random forest model with cross-validation. If you are not familiar with random forest models, check out the Wikipedia entry for an overview or, for a more in-depth treatment, see The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, and Jerome Friedman or Applied Predictive Modeling by Max Kuhn and Kjell Johnson (both from Springer), which are excellent resources on the topic. You can formulate a random forest model with cross-validation and evaluate the model’s performance with a few lines of code in Scikit-Learn:

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import KFold
from sklearn.ensemble import RandomForestClassifier as RF

y = data_frame['Purchased?']
y_pred = y.copy()
feature_space = data_frame[numeric_columns]
X = feature_space.as_matrix().astype(np.float)
scaler = StandardScaler()
X = scaler.fit_transform(X)

kf = KFold(len(y), n_folds=5, shuffle=True, random_state=123)
for train_index, test_index in kf:
    X_train, X_test = X[train_index], X[test_index]
    y_train = y[train_index]
    clf = RF()
    clf.fit(X_train, y_train)
    y_pred[test_index] = clf.predict(X_test)

accuracy = np.mean(y == y_pred)
print "Random forest: " + "%.3f" % (accuracy)

The first five lines of code import NumPy, pandas, and three components of Scikit-Learn. In order, the three components enable you to center and scale the explanatory variables, carry out k-fold cross-validation, and use the random forest classifier.

The next block of code handles specifying the dependent variable, y; creating the matrix of explanatory variables, X; and centering and scaling the explanatory variables. This block assumes that you’ve already created a pandas DataFrame called data_frame and the data for the dependent variable is in a column called Purchased?. The copy function makes a copy of the dependent variable, assigning it into y_pred, which will be used to evaluate the model’s performance. The next line assumes you’ve created a list of the numeric variables in data_frame so you can select them as your set of explanatory variables. The following line uses NumPy and pandas functions to transform the feature set into a matrix called X. The last two lines in the block use Scikit-Learn functions to create and use a scaler object to center and scale the explanatory variables.

The next block of code implements k-fold cross-validation with the random forest classifier. The first line uses the KFold function to split the dataset into five different pairs, or folds, of training and test sets. The next line is a for loop for iterating through each of the folds. Within the for loop, for each fold, we assign the training and test sets of explanatory variables, assign the training set for the dependent variable, initialize the random forest classifier, fit the random forest model with the training data, and then use the model and test data to estimate predicted values for the dependent variable.

The final block of code calculates and reports the model’s accuracy. The first line uses NumPy’s mean function to calculate the average number of times the predicted values for the dependent variable equal the actual, original data values. The evaluation in parentheses tests whether the two values are equal (i.e., whether they are both 1 or are both 0), so the mean function averages a series of 1s and 0s. If all of the predicted values match the original data values, then the average will be 1. If all of the predicted values do not match the original values, then the average will be 0. Therefore, we want the cross-validated random forest classifier to produce an average value that is close to 1. The final line prints the model’s accuracy, formatted to three decimal places, to the screen.

This example illustrated how to carry out cross-validation with a random forest classifier model in Scikit-Learn. Scikit-Learn enables you to specify many more regression and classification models than were presented in this section. For example, to implement a support vector machine, all you would need to do is add the following import statement (and change the classifier from clf = RF() to clf = SVC()):

from sklearn.svm import SVC

In addition to other models, Scikit-Learn also has functions for data pre-processing, dimensionality reduction, and model selection.

To learn more about Scikit-Learn and how to estimate other models and use other cross-validation methods in Scikit-Learn, check out the Scikit-Learn documentation.

A Few Additional Add-in Packages

In addition to NumPy, SciPy, and Scikit-Learn, there are a few additional add-in packages that you may want to look into, depending on the type of data analysis you need to do. This list represents a tiny fraction of the thousands of add-in Python packages on the Python Package Index and is simply intended as a suggestion of some packages that you might find intriguing and useful:

xarray: Provides a pandas-like toolkit for analysis on multidimensional arrays
SKLL: Provides command-line utilities for running common Scikit-Learn operations
NetworkX: Provides functions for creating, growing, and analyzing complex networks
PyMC: Provides functions for implementing Bayesian statistics and MCMC
NLTK: Provides text processing and analysis libraries for human language data
Cython: Provides an interface for calling and generating fast C code in Python

These packages do not come preinstalled with Python. You have to download and install them separately. To do so, visit the Python Package Index website or the Unofficial Windows Binaries for Python Extension Packages website.

Additional Data Structures

As you move on from this book and start to solve various business data processing and analysis tasks with Python, it will become increasingly important for you to become familiar with some additional data structures. By learning about these concepts, you’ll expand your toolkit to include a broader understanding of the various ways it is possible to implement a solution and be able to evaluate the trade-offs between different options. You’ll also become savvy about what data structures to use in a specific circumstance to store, process, or analyze your data more quickly and efficiently.

Additional data structures that are helpful to know about include stacks, queues, graphs, and trees. In certain circumstances, these data structures will store and retrieve your data more efficiently and with better memory utilization than lists, tuples, or dictionaries.

Stacks

A stack is an ordered collection of items where you add an item to and remove an item from the same end of the stack. You can only add or remove one item at a time. The end where you add and remove items is called the top. The opposite end is called the base. Given a stack’s ordering principle, items near the top have been in the stack for less time than items near the base. In addition, the order in which you remove items from the stack is opposite to the order in which you add them. This property is called LIFO (last in, first out).

Consider a stack of trays in a cafeteria. To create the stack, you place a tray on the counter, then you place another tray on top of the first tray, and so on. To shrink the stack, you take a tray from the top of the stack. You can add or remove a tray at any time, but it must always be added to or removed from the top.

There are lots of data processing and analysis situations in which it’s helpful to use stacks. People implement stacks to allocate and access computer memory, to store command-line and function arguments, to parse expressions, to reverse data items, and to store and backtrack through URLs.

Queues

A queue is an ordered collection of items where you add items to one end of the queue and remove items from the other end of the queue. In a queue, you add items to the rear and they make their way to the front, where they are removed. Given a queue’s ordering principle, items near the rear have been in the queue for less time than items near the front. This property is called FIFO (first in, first out).

Consider any well-maintained queue, or line, you’ve ever waited in. Regardless of whether you’re at a theme park, a movie theater, or a grocery store, you enter the queue at the back and then wait until you make your way to the front of the queue, where you receive your ticket or service; then you leave the queue.

There are lots of data processing and analysis situations in which it’s helpful to use queues. People implement queues to process print jobs on a printer, to hold computer processes waiting for resources, and to optimize queues and network flows.

Graphs

A graph is a set of nodes (a.k.a. vertices) and edges, which connect to nodes. The edges can be directed, representing a direction between two nodes, or undirected, representing a connection between two nodes with no particular direction. They can also have weights that represent some relationship between the two nodes, which depends on the context of the problem.

Consider any graphical representation of the relationship between people, places, or topics. For example, imagine a collection of actors, directors, and movies as nodes on a canvas and edges between the nodes indicating who acted in or directed the movies. Alternatively, imagine cities as nodes on a canvas and the edges between the nodes indicating the paths to get from one city to the next. The edges can be weighted to indicate the distance between two cities.

There are lots of data processing and analysis situations in which it’s helpful to use graphs. People implement graphs to represent relationships among suppliers, customers, and products; to represent relationships between entities; to represent maps; and to represent capacity and demand for different resources.

Trees

A tree is a specific type of graph data structure consisting of a hierarchical set of nodes and edges. In a tree, there is a single topmost node, which is designated as the root node. The root node may have any number of child nodes. Each of these nodes may also have any number of child nodes. A child node may only have one parent node. If each node has a maximum of two child nodes, then the tree is called a binary tree.

Consider elements in HTML. Within the html tags there are head and body tags. Within the head tag, there may be meta and title tags. Within the body tag, there may be h1, div, form, and ul tags. Viewing these tags in a tree structure, you see the html tag as the root node with two child nodes, the head and body tags. Under the node for the head tag, you see meta and title tags. Under the node for the body tag, you see h1, div, form, and ul tags.

There are lots of data processing and analysis situations in which it’s helpful to use trees. People implement trees to create computer filesystems, to manage hierarchical data, to make information easy to search, and to represent the phrase structure of sentences.

This section presented a very brief introduction some classic data structures. A nice resource for learning more about these data structures in Python is Brad Miller and David Ranum’s online book, Problem Solving with Algorithms and Data Structures Using Python.

It’s helpful to know that these data containers exist so you have the option to use them when your initial implementation isn’t performing well. By knowing about these data structures and understanding when to use one type versus another, you’ll be able to solve a variety of large, difficult problems and improve the processing time and memory utilization of many of your scripts.

Where to Go from Here

If when you started reading this book you had never programmed before, then you’ve picked up a lot of foundational programming experience as you followed along with the examples. We started by downloading Python, writing a basic Python script in a text editor, and figuring out how to run the script on Windows and macOS. After that, we developed a first script to explore many of Python’s basic data types, data containers or structures, control flow, and how to read and write text files. From there, we learned how to parse specific rows and columns in CSV files; how to parse specific worksheets, rows, and columns in Excel files; and how to load data, modify data, and write out data in databases. In Chapter 3 and Chapter 4, we downloaded MySQL and some extra Python add-in modules. Once we had all of that experience under our belts, in Chapter 5 we applied and extended our new programming skills to tackle three real-world applications. Then, in Chapter 6 and Chapter 7, we transitioned from data processing to data visualization and statistical analysis. Finally, in Chapter 8, we learned how to automate our scripts so they run on a routine basis without us needing to run them manually from the command line. Having arrived at the end of the book, you may be thinking, “Where should I go from here?”

At this point in my own training, I received some valuable advice: “Identify an important or interesting specific problem/task that you think could be improved with Python and work on it until you accomplish what you set out to do.” You want to choose an important or interesting problem so you’re excited about the project and invested in accomplishing your goal. You’re going to hit stumbling blocks, wrong turns, and dead ends along the way, so the project has to be important enough to you that you persevere through the difficult patches and keep writing, debugging, and editing until your code works. You also want to select a specific problem or task so that what you need your code to do is clearly defined. For example, your situation may be that you have too many files to process manually, so you need to figure out how to process them with Python. Or perhaps you’re responsible for a specific data processing or analysis task and you think the task could be automated and made more efficient and consistent with Python. Once you have a specific problem or task in mind, it’s easier to think about how to break it down into the individual operations that need to happen to accomplish your goal.

Once you’ve chosen a specific problem or task and outlined the operations that need to happen, you’re in a really good position. It’s easier to figure out how to accomplish one particular operation than it is to envision how to accomplish the whole task at once. The quote I’m thinking of here is, “How do you devour a whale? One bite at a time.” The nice thing about tackling one particular operation at a time is that, for each operation, it’s highly likely that someone else has already tackled that problem, figured it out, and shared his or her code online or in a book.

The Internet is your friend, especially when it comes to code. We’ve already covered how to read CSV and Excel files in this book, but what if you need to read a different type of file, such as a JSON or HTML file? Open a browser and enter something like “python read json file examples” in the search bar to see how other people have read JSON files in Python. The same advice goes for all of the other operations you’ve outlined for your problem or task. In addition to online resources, which are very helpful once you’ve narrowed down to a specific operation, there are also many books and training materials on Python that contain helpful code snippets and examples. You can find many free PDF versions of Python books online, and many are also available through your local and county libraries. My point is that you don’t have to reinvent the wheel. For each small operation in your overall problem or task, use whatever code you can from this book, search online and in other resources to see how others have tackled the operation, and then edit and debug until you get it working. What you’ll end up with, after you’ve tackled each of the individual operations, is a Python script that solves your specific problem or task. And that’s the exciting moment you’re working toward: that moment when you press a button and the code you’ve labored over for days or weeks to get working carries out your instructions and solves your problem or task for you—the feeling is exhilarating and empowering. Once you realize that you can efficiently accomplish tasks that would be tedious, time consuming, error prone, or impossible to do manually, you’ll feel a rush of excitement and be looking for more problems and tasks to solve with Python. That’s what I hope for you—that you go on from here and work on a problem or task that’s important to you until your code works and you accomplish what you set out to do.

Previous Chapter

8. Scheduling Scripts to Run Automatically

Next Chapter

A. Download Instructions

Table of Contents for Foundations for Analytics with Python