Chapter 10. Data Manipulation

This chapter summarizes the popular Python libraries related to data manipulation: numeric, text, images, and audio. Almost all of the libraries described here serve a unique purpose, so this chapter’s goal is to describe these libraries, not compare them. Unless noted, all of them can be installed directly from PyPI using pip:

$ pip install library

Table 10-1 briefly describes these libraries.

Table 10-1. Data tools
Python library License Reason to use

IPython

Apache 2.0 license

  • Provides enhanced Python interpreter, with input history, integrated debugger, and graphics and plots in-terminal (with the Qt-enabled version).

Numpy

BSD 3-clause license

  • Provides multidimensional arrays and linear algebra tools, optimized for speed.

SciPy

BSD license

  • Provides functions and utilities related to engineering and science, from linear algebra to signal processing, integration, root finding, statistical distributions, and other topics.

Matplotlib

BSD license

  • Provides scientific plotting.

Pandas

BSD license

  • Provides series and DataFrame objects that can be sorted, merged, grouped, aggregated, indexed, windowed, and subset—a lot like an R Data Frame or the contents of a SQL query.

Scikit-Learn

BSD 3-clause license

  • Provides machine learning algorithms, including dimensionality reduction classification, regression, clustering, model selection, imputing missing data, and preprocessing.

Rpy2

GPLv2 license

  • Provides an interface to R that allows execution of R functions from within Python, and passing data between the two environments.

SymPy

BSD license

  • Provides symbolic mathematics, including series expansions, limits, and calculus, aiming to be a full computer algebra system.

nltk

Apache license

  • Provides comprehensive natural language toolkit, with models and training data in multiple languages.

pillow / PIL

Standard PIL license
(like MIT)

  • Provides huge number of file formats, plus some simple image filtering and other processing.

cv2

Apache 2.0 license

  • Provides computer vision routines suitable for real-time analysis in videos, including already-trained face and person detection algorithms.

Scikit-Image

BSD license

  • Provides image processing routines—filtering, adjustment, color separation, edge, blob, and corner detection, segmentaton, and more.

Nearly all of the libraries described in Table 10-1 and detailed in the rest of this chapter depend on C libraries, and specifically on SciPy, or one of its dependencies, NumPy. This means you may have trouble installing these if you’re on a Windows system. If you primarily use Python for analyzing scientific data, and you’re not familiar with compiling C and FORTRAN code on Windows already, we recommend using Anaconda or one of the other options discussed in “Commercial Python Redistributions”. Otherwise, always try pip install first and if that fails, look at the SciPy installation guide.

Scientific Applications

Python is frequently used for high-performance scientific applications. It is widely used in academia and scientific projects because it is easy to write and performs well.

Due to its high performance nature, scientific computing in Python often utilizes external libraries, typically written in faster languages (like C, or FORTRAN for matrix operations). The main libraries used are all part of the “SciPy Stack:” NumPy, SciPy, SymPy, Pandas, Matplotlib, and IPython. Going into detail about these libraries is beyond the scope of this book. However, a comprehensive introduction to the scientific Python ecosystem can be found in the Python Scientific Lecture Notes.

IPython

IPython is an enhanced version of Python interpreter, with color interface, more detailed error messages, and an inline mode that allows graphics and plots to be displayed in the terminal (Qt-based version). It is the default kernel for Jupyter notebooks (discussed in “Jupyter Notebooks”), and the default interpreter in the Spyder IDE (discussed in “Spyder”). IPython comes installed with Anaconda, which we described in “Commercial Python Redistributions”.

NumPy

NumPy is part of the SciPy project but is released as a separate library so people who only need the basic requirements can use it without installing the rest of SciPy. NumPy cleverly overcomes the problem of running slower algorithms on Python by using multidimensional arrays and functions that operate on arrays. Any algorithm can then be expressed as a function on arrays, allowing the algorithms to be run quickly. The backend is the Automatically Tuned Linear Algebra Software (ATLAS) library,1 and other low-level libraries written in C and FORTRAN. NumPy is compatible with Python versions 2.6+ and 3.2+.

Here is an example of a matrix multiplication, using array.dot(), and “broadcasting,” which is element-wise multiplication where the row or column is repeated across the missing dimension:

>>> import numpy as np
>>>
>>> x = np.array([[1,2,3],[4,5,6]])
>>> x
array([[1, 2, 3],
       [4, 5, 6]])
>>>
>>> x.dot([2,2,1])
array([ 9, 24])
>>>
>>> x  * [[1],[0]]
array([[1, 2, 3],
       [0, 0, 0]])

SciPy

SciPy uses NumPy for more mathematical functions. SciPy uses NumPy arrays as the basic data structure, and comes with modules for various commonly used tasks in scientific programming, including linear algebra, calculus, special functions and constants, and signal processing.

Here’s an example from SciPy’s set of physical constants:

>>> import scipy.constants
>>> fahrenheit = 212
>>> scipy.constants.F2C(fahrenheit)
100.0
>>> scipy.constants.physical_constants['electron mass']
(9.10938356e-31, 'kg', 1.1e-38)

Matplotlib

Matplotlib is a flexible plotting library for creating interactive 2D and 3D plots that can also be saved as manuscript-quality figures. The API in many ways reflects that of MATLAB, easing transition of MATLAB users to Python. Many examples, along with the source code to re-create them, are available in the Matplotlib gallery.

Those who work with statistics should also look at Seaborn, a newer graphics library specifically for statistics visualization that is growing in popularity. It is featured in this blog post about getting started in data science.

For web-capable plots, try Bokeh, which uses its own visualization libraries, or Plotly, which is based on the JavaScript library D3.js, although the free version of Plotly may require storing your plots on their server.

Pandas

Pandas (the name is derived from Panel Data) is a data manipulation library based on NumPy which provides many useful functions for accessing, indexing, merging and grouping data easily. The main data structure (DataFrame) is close to what could be found in the R statistical software environment (i.e., heterogeneous data tables—with strings in some columns and numbers in others—with name indexing, time series operations and auto-alignment of data). But it also can be operated on like a SQL table or Excel Pivot Table—using methods like groupby() or functions like pandas.rolling_mean().

Scikit-Learn

Scikit-Learn is a machine learning library that provides dimension reduction, missing data imputation, regression and classification models, tree models, clustering, automatic model parameter tuning, plotting (via matplotlib), and more. It is well documented and comes with tons of examples. Scikit-Learn operates on NumPy arrays but can usually interface with Pandas data frames without much trouble.

Rpy2

Rpy2 is a Python binding for the R statistical package allowing the execution of R functions from Python and passing data back and forth between the two environments. Rpy2 is the object-oriented implementation of the Rpy bindings.

decimal, fractions, and numbers

Python has defined a framework of abstract base classes to develop numeric types from Number, the root of all numeric types, to Integral, Rational, Real, and Complex. Developers can subclass these to develop other numeric types according to the instructions in the numbers library.2 There is also a decimal.Decimal class that is aware of numerical precision, for accounting and other precision-critical tasks. The type hierarchy works as expected:

>>> import decimal
>>> import fractions
>>> from numbers import Complex, Real, Rational, Integral
>>>
>>> d = decimal.Decimal(1.11, decimal.Context(prec=5))  # precision
>>>
>>> for x in (3, fractions.Fraction(2,3), 2.7, complex(1,2), d):
...     print('{:>10}'.format(str(x)[:8]),
...           [isinstance(x, y) for y in (Complex, Real, Rational, Integral)])
...
         3 [True, True, True, True]
       2/3 [True, True, True, False]
       2.7 [True, True, False, False]
    (1+2j) [True, False, False, False]
  1.110000 [False, False, False, False]

The exponential, trigonometric, and other common functions are in the math library, and corresponding functions for complex numbers are in cmath. The random library provides pseudorandom numbers using the Mersenne Twister as its core generator. As of Python 3.4, the statistics module in the Standard Library provides the mean and median, as well as the sample and population standard deviation and variance.

SymPy

SymPy is the library to use when doing symbolic mathematics in Python. It is written entirely in Python, with optional extensions for speed, plotting, and interactive sessions.

SymPy’s symbolic functions operate on SymPy objects such as symbols, functions, and expressions to make other symbolic expressions, like this:

>>> import sympy as sym
>>>
>>> x = sym.Symbol('x')
>>> f = sym.exp(-x**2/2) / sym.sqrt(2 * sym.pi)
>>> f
sqrt(2)*exp(-x**2/2)/(2*sqrt(pi))

These can be symbolically or numerically integrated:

>>> sym.integrate(f, x)
erf(sqrt(2)*x/2)/2
>>>
>>> sym.N(sym.integrate(f, (x, -1, 1)))
0.682689492137086

The library can also differentiate, expand expressions into series, restrict symbols to be real, commutative, or a dozen or so other categories, locate the nearest rational number (given an accuracy) to a float, and much more.

Text Manipulation and Text Mining

Python’s string manipulation tools are often why people start using the language to begin with. We’ll cover some highlights from Python’s Standard Library quickly, and then move to the library nearly everyone in the community uses for text mining: the Natural Language ToolKit (nltk).

String Tools in Python’s Standard Library

For languages with special behavior of lowercase characters, str.casefold() helps with lowercase letters:

>>> 'Grünwalder Straße'.upper()
'GRÜNWALDER STRASSE'
>>> 'Grünwalder Straße'.lower()
'grünwalder straße'
>>> 'Grünwalder Straße'.casefold()
'grünwalder strasse'

Python’s regular expression library re is comprehensive and powerful—we saw it in action in “Regular expressions (readability counts)”, so we won’t add more here, except that the help(re) documentation is so complete that you won’t need to open a browser while coding.

Finally, the difflib module in the Standard Library identifies differences between strings, and has a function get_close_matches() that can help with misspellings when there are a known set of correct answers (e.g., for error prompts on a travel website):

>>> import difflib
>>> capitals = ('Montgomery', 'Juneau', 'Phoenix', 'Little Rock')
>>> difflib.get_close_matches('Fenix', capitals)
['Phoenix']

nltk

The Natural Language ToolKit (nltk) is the Python tool for text analysis: originally released by Steven Bird and Edward Loper to aid students in Bird’s course on Natural Language Processing (NLP) at the University of Pennsylvania in 2001, it has grown to an expansive library covering multiple languages and containing algorithms for recent research in the field. It is available under the Apache 2.0 license and is downloaded from PyPI over 100,000 times per month. Its creators have an accompanying book, Natural Language Processing with Python (O’Reilly), that is accessible as a course text introducing both Python and NLP.

You can install nltk from the command line using pip.3 It also relies on NumPy, so install that first:

$ pip install numpy
$ pip install nltk

If you’re using Windows, and can’t get the NumPy installed using pip to work, you can try following the instructions in this Stack Overflow post. The size and scope of the library may unnecessarily scare some people away, so here’s a tiny example to demonstrate how easy simple uses can be. First, we need to get a dataset from the separately downloadable collection of corpora, including tagging tools for multiple languages and datasets to test algorithms against. These are licensed separate from nltk, so be sure to check your selection’s individual license. If you know the name of the corpus you want to download (in our case, the Punkt tokenizer,4 which we can use to split up text files into sentences or words), you can do it on the command line:

$ python3 -m nltk.downloader punkt --dir=/usr/local/share/nltk_data

Or you can download it in an interactive session—“stopwords” contains a list of common words that tend to overpower word counts, such as “the”, “in”, or “and” in many languages:

>>> import nltk
>>> nltk.download('stopwords', download_dir='/usr/local/share/nltk_data')
[nltk_data] Downloading package stopwords to /usr/local/share/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
True

And if you don’t know the name of the corpus you want, you can launch an interactive downloader from the Python interpreter by invoking nltk.download() without its first argument:

>>> import nltk
>>> nltk.download(download_dir='/usr/local/share/nltk_data')

Then we can load the dataset we care about, and process and analyze it. In this code sample, we are loading a saved copy of the Zen of Python:

>>> import nltk
>>> from nltk.corpus import stopwords
>>> import string
>>>
>>> stopwords.ensure_loaded()  1
>>> text = open('zen.txt').read()
>>> tokens = [
...     t.casefold() for t in nltk.tokenize.word_tokenize(text)  2
...     if t not in string.punctuation
... ]
>>>
>>> counter = {}
>>> for bigram in nltk.bigrams(tokens):  3
...     counter[bigram] = 1 if bigram not in counter else counter[bigram] + 1
...
>>> def print_counts(counter):  # We'll reuse this
...     for ngram, count in sorted(
...             counter.items(), key=lambda kv: kv[1], reverse=True):  4
...         if count > 1:
...             print ('{:>25}: {}'.format(str(ngram), '*' * count))  5
...
>>> print_counts(counter)
       ('better', 'than'): ********  6
         ('is', 'better'): *******
        ('explain', 'it'): **
            ('one', '--'): **
        ('to', 'explain'): **
            ('if', 'the'): **
('the', 'implementation'): **
 ('implementation', 'is'): **
>>>
>>> kept_tokens = [t for t in tokens if t not in stopwords.words()]  7
>>>
>>> from collections import Counter  8
>>> c = Counter(kept_tokens)
>>> c.most_common(5)
[('better', 8), ('one', 3), ('--', 3), ('although', 3), ('never', 3)]
1

The corpora are loaded lazily, so we need to do this to actually load the stopwords corpus.

2

The tokenizer requires a trained model—the Punkt tokenizer (default) comes with a model trained on English (also default).

3

A bigram is a pair of adjacent words. We are iterating over the bigrams and counting how many times they occur.

4

The sorted() function here is being keyed on the count, and sorted in reverse order.

5

The '{:>25}' right-justifies the string with a total width of 25 characters.

6

The most frequently occurring bigram in the Zen of Python is “better than.”

7

This time, to avoid high counts of “the” and “is”, we remove the stopwords.

8

In Python 3.1 and later, you can use collections.Counter for the counting.

There’s a lot more in this library—take a weekend and go for it!

SyntaxNet

Google’s SyntaxNet, built on top of TensorFlow, provides a trained English parser (named Parsey McParseface) and the framework to build other models, even in other languages, provided you have labeled data. It is currently only available for Python 2.7; detailed instructions for downloading and using it are on SyntaxNet’s main GitHub page.

Image Manipulation

The three most popular image processing and manipulation libraries in Python are Pillow (a friendly fork of the Python Imaging Library [PIL]—which is good for format conversions and simple image processing), cv2 (the Python bindings for OpenSource Computer Vision [OpenCV] that can be used for real-time face detection and other advanced algorithms), and the newer Scikit-Image, which provides simple image processing, plus primitives like blob, shape, and edge detection. The following sections provide some more information about each of these libraries.

Pillow

The Python Imaging Library, or PIL for short, is one of the core libraries for image manipulation in Python. Its was last released in 2009 and was never ported to Python 3. Luckily, there’s an actively developed fork of PIL called Pillow—it’s easier to install, runs on all operating systems, and supports Python 3.

Before installing Pillow, you’ll have to install Pillow’s prerequisites. Find the instructions for your platform in the Pillow installation instructions. After that, it’s straightforward:

$ pip install Pillow

Here is a brief example use of Pillow (yes, the module name to import from is PIL not Pillow):

from PIL import Image, ImageFilter
# Read image
im = Image.open( 'image.jpg' )
# Display image
im.show()

# Applying a filter to the image
im_sharp = im.filter( ImageFilter.SHARPEN )
#Saving the filtered image to a new file
im_sharp.save( 'image_sharpened.jpg', 'JPEG' )

# Splitting the image into its respective bands (i.e., Red, Green,
# and Blue for RGB)
r,g,b = im_sharp.split()

# Viewing EXIF data embedded in image
exif_data = im._getexif()
exif_data

There are more examples of the Pillow library in the Pillow tutorial.

cv2

OpenSource Computer Vision, more commonly known as OpenCV, is a more advanced image manipulation and processing software than PIL. It is written in C and C++, and focuses on real-time computer vision. For example, it has the first model used in real-time face detection (already trained on thousands of faces; this example shows it being used in Python code), a face recognition model, and a person detection model, among others. It has been implemented in several languages and is widely used.

In Python, image processing using OpenCV is implemented using the cv2 and NumPy libraries. OpenCV version 3 has bindings for Python 3.4 and above, but the cv2 library is still linked to OpenCV2, which does not. The installation instructions in the OpenCV tutorial page have explicit details for Windows and Fedora, using Python 2.7. On OS X, you’re on your own.5 Finally, here’s an option using Python 3 on Ubuntu. If the installation becomes difficult, you can downlad Anaconda and use that instead; they have cv2 binaries for all platforms, and you consult the blog post “Up & Running: OpenCV3, Python 3, & Anaconda” to use cv2 and Python 3 on Anaconda.

Here’s an example use of cv2:

from cv2 import *
import numpy as np
#Read Image
img = cv2.imread('testimg.jpg')
#Display Image
cv2.imshow('image',img)
cv2.waitKey(0)
cv2.destroyAllWindows()

#Applying Grayscale filter to image
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

#Saving filtered image to new file
cv2.imwrite('graytest.jpg',gray)

There are more Python-implemented examples of OpenCV in this collection of tutorials.

Scikit-Image

A newer library, Scikit-Image, is growing in popularity, thanks partly to having more of its source in Python and also its great documentation. It doesn’t have the full-fledged algorithms like cv2, which you’d still use for algorithms that work on real-time video, but it’s got enough to be useful for scientists—like blob detection and feature detection, plus it has the standard image processing tools like filtering and contrast adjustment. For example, Scikit-image was used to make the image composites of Pluto’s smaller moons. There are many more examples on the main Scikit-Image page.

1 ATLAS is an ongoing software project that provides tested, performant linear algebra libraries. It provides C and FORTRAN 77 interfaces to routines from the well-known Basic Linear Algebra Subset (BLAS) and Linear Algebra PACKage (LAPACK).

2 One popular tool that makes use of Python numbers is SageMath—a large, comprehensive tool that defines classes to represents fields, rings, algebras and domains, plus provides symbolic tools derived from SymPy and numerical tools derived from NumPy, SciPy, and many other Python and non-Python libraries.

3 On Windows, it currently appears that nltk is only available for Python 2.7. Try it on Python 3, though; the labels that say Python 2.7 may just be out of date.

4 The Punkt tokenizer algorithm was introduced by Tibor Kiss and Jan Strunk in 2006, and is a language-independent way to identify sentence boundaries—for example, “Mrs. Smith and Johann S. Bach listened to Vivaldi” would correctly be identified as a single sentence. It has to be trained on a large dataset, but the default tokenizer, in English, has already been trained for us.

5 These steps worked for us: first, use brew install opencv or brew install opencv3 --with-python3. Next, follow any additional instructions (like linking NumPy). Last, add the directory containing the OpenCV shared object file (e.g., /usr/local/Cellar/opencv3/3.1.0_3/lib/python3.4/site-packages/) to your path; or to only use it in a virtual environment, use the add2virtualenvironment command installed with the virtualenvwrapper library.