This chapter summarizes the popular Python libraries
related to data manipulation: numeric, text, images, and audio.
Almost all of the libraries described here serve a unique purpose, so this chapter’s goal is to describe these libraries, not compare them.
Unless noted, all of them can be installed directly
from PyPI using pip:
$pipinstalllibrary
Table 10-1 briefly describes these libraries.
| Python library | License | Reason to use |
|---|---|---|
IPython |
Apache 2.0 license |
|
Numpy |
BSD 3-clause license |
|
SciPy |
BSD license |
|
Matplotlib |
BSD license |
|
Pandas |
BSD license |
|
Scikit-Learn |
BSD 3-clause license |
|
Rpy2 |
GPLv2 license |
|
SymPy |
BSD license |
|
nltk |
Apache license |
|
pillow / PIL |
Standard PIL license |
|
cv2 |
Apache 2.0 license |
|
Scikit-Image |
BSD license |
|
Nearly all of the libraries described in
Table 10-1 and detailed in the rest of this chapter
depend on C libraries,
and specifically on SciPy,
or one of its dependencies, NumPy.
This means you may have trouble installing these
if you’re on a Windows system. If you primarily use Python for analyzing scientific data, and you’re not familiar with compiling
C and FORTRAN code on Windows already, we recommend
using Anaconda or one of the other options discussed in
“Commercial Python Redistributions”.
Otherwise, always try pip install first
and if that fails, look at the
SciPy installation guide.
Python is frequently used for high-performance scientific applications. It is widely used in academia and scientific projects because it is easy to write and performs well.
Due to its high performance nature, scientific computing in Python often utilizes external libraries, typically written in faster languages (like C, or FORTRAN for matrix operations). The main libraries used are all part of the “SciPy Stack:” NumPy, SciPy, SymPy, Pandas, Matplotlib, and IPython. Going into detail about these libraries is beyond the scope of this book. However, a comprehensive introduction to the scientific Python ecosystem can be found in the Python Scientific Lecture Notes.
IPython is an enhanced version of Python interpreter, with color interface, more detailed error messages, and an inline mode that allows graphics and plots to be displayed in the terminal (Qt-based version). It is the default kernel for Jupyter notebooks (discussed in “Jupyter Notebooks”), and the default interpreter in the Spyder IDE (discussed in “Spyder”). IPython comes installed with Anaconda, which we described in “Commercial Python Redistributions”.
NumPy is part of the SciPy project but is released as a separate library so people who only need the basic requirements can use it without installing the rest of SciPy. NumPy cleverly overcomes the problem of running slower algorithms on Python by using multidimensional arrays and functions that operate on arrays. Any algorithm can then be expressed as a function on arrays, allowing the algorithms to be run quickly. The backend is the Automatically Tuned Linear Algebra Software (ATLAS) library,1 and other low-level libraries written in C and FORTRAN. NumPy is compatible with Python versions 2.6+ and 3.2+.
Here is an example of a matrix multiplication, using array.dot(),
and “broadcasting,” which is element-wise multiplication where the
row or column is repeated across the missing dimension:
>>>importnumpyasnp>>>>>>x=np.array([[1,2,3],[4,5,6]])>>>xarray([[1,2,3],[4,5,6]])>>>>>>x.dot([2,2,1])array([9,24])>>>>>>x*[[1],[0]]array([[1,2,3],[0,0,0]])
SciPy uses NumPy for more mathematical functions. SciPy uses NumPy arrays as the basic data structure, and comes with modules for various commonly used tasks in scientific programming, including linear algebra, calculus, special functions and constants, and signal processing.
Here’s an example from SciPy’s set of physical constants:
>>>importscipy.constants>>>fahrenheit=212>>>scipy.constants.F2C(fahrenheit)100.0>>>scipy.constants.physical_constants['electron mass'](9.10938356e-31,'kg',1.1e-38)
Matplotlib is a flexible plotting library for creating interactive 2D and 3D plots that can also be saved as manuscript-quality figures. The API in many ways reflects that of MATLAB, easing transition of MATLAB users to Python. Many examples, along with the source code to re-create them, are available in the Matplotlib gallery.
Those who work with statistics should also look at Seaborn, a newer graphics library specifically for statistics visualization that is growing in popularity. It is featured in this blog post about getting started in data science.
For web-capable plots, try Bokeh, which uses its own visualization libraries, or Plotly, which is based on the JavaScript library D3.js, although the free version of Plotly may require storing your plots on their server.
Pandas (the name is derived from Panel Data)
is a data manipulation library
based on NumPy which provides many useful functions for accessing,
indexing, merging and grouping data easily. The main data structure (DataFrame)
is close to what could be found in the R statistical software environment (i.e.,
heterogeneous data tables—with strings in some columns and numbers in others—with name indexing, time series operations and
auto-alignment of data). But it also can be operated on like a
SQL table or Excel Pivot Table—using methods like groupby()
or functions like pandas.rolling_mean().
Scikit-Learn is a machine learning library that provides dimension reduction, missing data imputation, regression and classification models, tree models, clustering, automatic model parameter tuning, plotting (via matplotlib), and more. It is well documented and comes with tons of examples. Scikit-Learn operates on NumPy arrays but can usually interface with Pandas data frames without much trouble.
Python has defined a framework of abstract base classes to develop numeric types
from Number, the root of all numeric types, to Integral, Rational, Real, and
Complex. Developers can subclass these to develop other numeric types according
to the instructions in
the numbers library.2
There is also a decimal.Decimal class that is aware of numerical precision,
for accounting and other precision-critical tasks.
The type hierarchy works as expected:
>>>importdecimal>>>importfractions>>>fromnumbersimportComplex,Real,Rational,Integral>>>>>>d=decimal.Decimal(1.11,decimal.Context(prec=5))# precision>>>>>>forxin(3,fractions.Fraction(2,3),2.7,complex(1,2),d):...('{:>10}'.format(str(x)[:8]),...[isinstance(x,y)foryin(Complex,Real,Rational,Integral)])...3[True,True,True,True]2/3[True,True,True,False]2.7[True,True,False,False](1+2j)[True,False,False,False]1.110000[False,False,False,False]
The exponential, trigonometric, and other common functions are in the math library, and corresponding functions for complex numbers are in cmath. The random library provides pseudorandom numbers using the Mersenne Twister as its core generator. As of Python 3.4, the statistics module in the Standard Library provides the mean and median, as well as the sample and population standard deviation and variance.
SymPy is the library to use when doing symbolic mathematics in Python. It is written entirely in Python, with optional extensions for speed, plotting, and interactive sessions.
SymPy’s symbolic functions operate on SymPy objects such as symbols, functions, and expressions to make other symbolic expressions, like this:
>>>importsympyassym>>>>>>x=sym.Symbol('x')>>>f=sym.exp(-x**2/2)/sym.sqrt(2*sym.pi)>>>fsqrt(2)*exp(-x**2/2)/(2*sqrt(pi))
These can be symbolically or numerically integrated:
>>>sym.integrate(f,x)erf(sqrt(2)*x/2)/2>>>>>>sym.N(sym.integrate(f,(x,-1,1)))0.682689492137086
The library can also differentiate, expand expressions into series, restrict symbols to be real, commutative, or a dozen or so other categories, locate the nearest rational number (given an accuracy) to a float, and much more.
Python’s string manipulation tools are often why people start using the language to begin with. We’ll cover some highlights from Python’s Standard Library quickly, and then move to the library nearly everyone in the community uses for text mining: the Natural Language ToolKit (nltk).
For languages with special behavior of lowercase characters,
str.casefold() helps with lowercase letters:
>>>'Grünwalder Straße'.upper()'GRÜNWALDER STRASSE'>>>'Grünwalder Straße'.lower()'grünwalder straße'>>>'Grünwalder Straße'.casefold()'grünwalder strasse'
Python’s regular expression library re is comprehensive
and powerful—we saw it in action in
“Regular expressions (readability counts)”, so we won’t add more here, except
that the help(re) documentation
is so complete that you won’t need to open a browser
while coding.
Finally, the difflib module in the Standard
Library identifies differences between strings, and has
a function get_close_matches() that can help with misspellings
when there are a known set of correct answers (e.g., for error prompts
on a travel website):
>>>importdifflib>>>capitals=('Montgomery','Juneau','Phoenix','Little Rock')>>>difflib.get_close_matches('Fenix',capitals)['Phoenix']
The Natural Language ToolKit (nltk) is the Python tool for text analysis: originally released by Steven Bird and Edward Loper to aid students in Bird’s course on Natural Language Processing (NLP) at the University of Pennsylvania in 2001, it has grown to an expansive library covering multiple languages and containing algorithms for recent research in the field. It is available under the Apache 2.0 license and is downloaded from PyPI over 100,000 times per month. Its creators have an accompanying book, Natural Language Processing with Python (O’Reilly), that is accessible as a course text introducing both Python and NLP.
You can install nltk from the command line using pip.3 It also relies on NumPy,
so install that first:
$pipinstallnumpy$pipinstallnltk
If you’re using Windows, and can’t get the NumPy installed using pip to
work, you can try following the instructions in this Stack Overflow post.
The size and scope of the library may unnecessarily scare some people
away, so here’s a tiny example to demonstrate how easy simple
uses can be. First, we need to get a dataset from the
separately downloadable
collection of
corpora,
including tagging tools for multiple languages
and datasets to test algorithms against.
These are licensed separate from nltk, so be sure to
check your selection’s individual license.
If you know the name of the corpus you want to download
(in our case, the Punkt tokenizer,4 which we can use to split up text files into sentences
or words),
you can do it on the command line:
$python3 -m nltk.downloader punkt --dir=/usr/local/share/nltk_data
Or you can download it in an interactive session—“stopwords” contains a list of common words that tend to overpower word counts, such as “the”, “in”, or “and” in many languages:
>>>importnltk>>>nltk.download('stopwords',download_dir='/usr/local/share/nltk_data')[nltk_data]Downloadingpackagestopwordsto/usr/local/share/nltk_data...[nltk_data]Unzippingcorpora/stopwords.zip.True
And if you don’t know the name of the corpus you want,
you can launch an interactive downloader from the Python
interpreter by invoking nltk.download() without its
first argument:
>>>importnltk>>>nltk.download(download_dir='/usr/local/share/nltk_data')
Then we can load the dataset we care about, and process and analyze it. In this code sample, we are loading a saved copy of the Zen of Python:
>>>importnltk>>>fromnltk.corpusimportstopwords>>>importstring>>>>>>stopwords.ensure_loaded()>>>text=open('zen.txt').read()>>>tokens=[...t.casefold()fortinnltk.tokenize.word_tokenize(text)...iftnotinstring.punctuation...]>>>>>>counter={}>>>forbigraminnltk.bigrams(tokens):...counter[bigram]=1ifbigramnotincounterelsecounter[bigram]+1...>>>defprint_counts(counter):# We'll reuse this...forngram,countinsorted(...counter.items(),key=lambdakv:kv[1],reverse=True):...ifcount>1:...('{:>25}: {}'.format(str(ngram),'*'*count))...>>>print_counts(counter)('better','than'):********('is','better'):*******('explain','it'):**('one','--'):**('to','explain'):**('if','the'):**('the','implementation'):**('implementation','is'):**>>>>>>kept_tokens=[tfortintokensiftnotinstopwords.words()]>>>>>>fromcollectionsimportCounter>>>c=Counter(kept_tokens)>>>c.most_common(5)[('better',8),('one',3),('--',3),('although',3),('never',3)]

The corpora are loaded lazily, so we need to do this to actually load the stopwords corpus.

The tokenizer requires a trained model—the Punkt tokenizer (default) comes with a model trained on English (also default).

A bigram is a pair of adjacent words. We are iterating over the bigrams and counting how many times they occur.

The sorted() function here is being keyed on the count, and sorted
in reverse order.

The '{:>25}' right-justifies the string with a total width of 25 characters.

The most frequently occurring bigram in the Zen of Python is “better than.”

This time, to avoid high counts of “the” and “is”, we remove the stopwords.

In Python 3.1 and later, you can use collections.Counter for
the counting.
There’s a lot more in this library—take a weekend and go for it!
Google’s SyntaxNet, built on top of TensorFlow, provides a trained English parser (named Parsey McParseface) and the framework to build other models, even in other languages, provided you have labeled data. It is currently only available for Python 2.7; detailed instructions for downloading and using it are on SyntaxNet’s main GitHub page.
The three most popular image processing and manipulation libraries in Python are Pillow (a friendly fork of the Python Imaging Library [PIL]—which is good for format conversions and simple image processing), cv2 (the Python bindings for OpenSource Computer Vision [OpenCV] that can be used for real-time face detection and other advanced algorithms), and the newer Scikit-Image, which provides simple image processing, plus primitives like blob, shape, and edge detection. The following sections provide some more information about each of these libraries.
The Python Imaging Library, or PIL for short, is one of the core libraries for image manipulation in Python. Its was last released in 2009 and was never ported to Python 3. Luckily, there’s an actively developed fork of PIL called Pillow—it’s easier to install, runs on all operating systems, and supports Python 3.
Before installing Pillow, you’ll have to install Pillow’s prerequisites. Find the instructions for your platform in the Pillow installation instructions. After that, it’s straightforward:
$ pip install Pillow
Here is a brief example use of Pillow (yes, the module
name to import from is PIL not Pillow):
fromPILimportImage,ImageFilter# Read imageim=Image.open('image.jpg')# Display imageim.show()# Applying a filter to the imageim_sharp=im.filter(ImageFilter.SHARPEN)#Saving the filtered image to a new fileim_sharp.save('image_sharpened.jpg','JPEG')# Splitting the image into its respective bands (i.e., Red, Green,# and Blue for RGB)r,g,b=im_sharp.split()# Viewing EXIF data embedded in imageexif_data=im._getexif()exif_data
There are more examples of the Pillow library in the Pillow tutorial.
OpenSource Computer Vision, more commonly known as OpenCV, is a more advanced image manipulation and processing software than PIL. It is written in C and C++, and focuses on real-time computer vision. For example, it has the first model used in real-time face detection (already trained on thousands of faces; this example shows it being used in Python code), a face recognition model, and a person detection model, among others. It has been implemented in several languages and is widely used.
In Python, image processing using OpenCV is implemented using the cv2 and NumPy libraries. OpenCV version 3 has bindings for Python 3.4 and above, but the cv2 library is still linked to OpenCV2, which does not. The installation instructions in the OpenCV tutorial page have explicit details for Windows and Fedora, using Python 2.7. On OS X, you’re on your own.5 Finally, here’s an option using Python 3 on Ubuntu. If the installation becomes difficult, you can downlad Anaconda and use that instead; they have cv2 binaries for all platforms, and you consult the blog post “Up & Running: OpenCV3, Python 3, & Anaconda” to use cv2 and Python 3 on Anaconda.
Here’s an example use of cv2:
fromcv2import*importnumpyasnp#Read Imageimg=cv2.imread('testimg.jpg')#Display Imagecv2.imshow('image',img)cv2.waitKey(0)cv2.destroyAllWindows()#Applying Grayscale filter to imagegray=cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)#Saving filtered image to new filecv2.imwrite('graytest.jpg',gray)
There are more Python-implemented examples of OpenCV in this collection of tutorials.
A newer library, Scikit-Image, is growing in popularity, thanks partly to having more of its source in Python and also its great documentation. It doesn’t have the full-fledged algorithms like cv2, which you’d still use for algorithms that work on real-time video, but it’s got enough to be useful for scientists—like blob detection and feature detection, plus it has the standard image processing tools like filtering and contrast adjustment. For example, Scikit-image was used to make the image composites of Pluto’s smaller moons. There are many more examples on the main Scikit-Image page.
1 ATLAS is an ongoing software project that provides tested, performant linear algebra libraries. It provides C and FORTRAN 77 interfaces to routines from the well-known Basic Linear Algebra Subset (BLAS) and Linear Algebra PACKage (LAPACK).
2 One popular tool that makes use of Python numbers is SageMath—a large, comprehensive tool that defines classes to represents fields, rings, algebras and domains, plus provides symbolic tools derived from SymPy and numerical tools derived from NumPy, SciPy, and many other Python and non-Python libraries.
3 On Windows, it currently appears that nltk is only available for Python 2.7. Try it on Python 3, though; the labels that say Python 2.7 may just be out of date.
4 The Punkt tokenizer algorithm was introduced by Tibor Kiss and Jan Strunk in 2006, and is a language-independent way to identify sentence boundaries—for example, “Mrs. Smith and Johann S. Bach listened to Vivaldi” would correctly be identified as a single sentence. It has to be trained on a large dataset, but the default tokenizer, in English, has already been trained for us.
5 These steps worked for us: first, use brew install opencv or brew install opencv3 --with-python3. Next, follow any additional instructions (like linking NumPy). Last, add the directory containing the OpenCV shared object file (e.g., /usr/local/Cellar/opencv3/3.1.0_3/lib/python3.4/site-packages/) to your path; or to only use it in a virtual environment, use the add2virtualenvironment command installed with the virtualenvwrapper library.