We're going to acquire intelligence data from a variety of sources. In the previous chapter, we searched the WWW. We might use our own cameras or recording devices. We'll look at image processing and encoding in this chapter.
To work with images in Python, we'll need to install Pillow. This library gives us software tools to process image files. Pillow is a fork of the older PIL project; Pillow is a bit nicer to use than PIL.
Along the way, we'll visit some additional Python programming techniques, including:
Python is a very powerful programming language. In this chapter, we'll see a lot of sophistication available. We'll also lay a foundation to look at web services and geocoding in the next chapter.
As we've observed so far, our data comes in a wide variety of physical formats. In Chapter 1, Our Espionage Toolkit, we looked at ZIP files, which are archives that contain other files. In Chapter 2, Acquiring Intelligence Data, we looked at JSON files, which serialize many kinds of Python objects.
In this chapter, we're going to review some previous technology and then look at working specifically with CSV files. The important part is to look at the various kinds of image files that we might need to work with.
In all cases, Python encourages looking at a file as a kind of context. This means that we should strive to open files using the with statement so that we can be sure the file is properly closed when we're done with the processing. This doesn't always work out perfectly, so there are some exceptions.
There are many modules for working with files. We'll focus on two: glob and os.
The glob module implements filesystem globbing rules. When we use *.jpg in a command at the terminal prompt, a standard OS shell tool will glob or expand the wildcard name into a matching list of actual file names, as shown in the following snippet:
MacBookPro-SLott:code slott$ ls *.jpg 1drachmi_1973.jpg IPhone_Internals.jpg Common_face_of_one_euro_coin.jpg LHD_warship.jpg
The POSIX standard is for *.jpg to be expanded by the shell, prior to the ls program being run. In Windows, this is not always the case.
The Python glob module contains the glob() function that does this job from within a Python program. Here's an example:
>>> import glob
>>> glob.glob("*.jpg")
['1drachmi_1973.jpg', 'Common_face_of_one_euro_coin.jpg', 'IPhone_Internals.jpg', 'LHD_warship.jpg']When we evaluated glob.glob("*.jpg"), the return value was a list of strings with the names of matching files.
Many files have a path/name.extension format. For Windows, a device prefix and the backslash is used (c:path\name.ext). The Python os package provides a path module with a number of functions for working with file names and paths irrespective of any vagaries of punctuation. As the path module is in the os package, the components will have two levels of namespace containers: os.path.
We must always use functions from the os.path module for working with filenames. There are numerous functions to split paths, join paths, and create absolute paths from relative paths. For example, we should use os.path.splitext() to separate a filename from the extension. Here's an example:
>>> import os
>>> os.path.splitext( "1drachmi_1973.jpg" )
('1drachmi_1973', '.jpg')We've separated the filename from the extension without writing any of our own code. There's no reason to write our own parsers when the standard library already has them written.
In some cases, our files contain ordinary text. In this case, we can open the file and process the lines as follows:
with open("some_file") as data:
for line in data:
... process the line ...This is the most common way to work with text files. Each line processed by the for loop will include a trailing \n character.
We can use a simple generator expression to strip the trailing spaces from each line:
with open("some_file") as data:
for line in (raw.rstrip() for raw in data):
... process the line ...We've inserted a generator expression into the for statement. The generator expression has three parts: a subexpression (raw.rstrip()), a target variable (raw), and a source iterable collection (data). Each line in the source iterable, data, is assigned to the target, raw, and the subexpression is evaluated. Each result from the generator expression is made available to the outer for loop.
We can visually separate the generator expression into a separate line of code:
with open("some_file") as data:
clean_lines = (raw.rstrip() for raw in data)
for line in clean_lines:
... process the line ...We wrote the generator expression outside the for statement. We assigned the generator—not the resulting collection—to the clean_lines variable to clarify its purpose. A generator doesn't generate any output until the individual lines are required by another iterator, in this case, the for loop. There's no real overhead: the processing is simply separated visually.
This technique allows us to separate different design considerations. We can separate the text cleanup from the important processing inside the for statement.
We can expand on the cleanup by writing additional generators:
with open("some_file") as data:
clean_lines = (raw.rstrip() for raw in data)
non_blank_lines = (line for line in clean_lines if len(line) != 0)
for line in non_blank_lines:
... process the line ...We've broken down two preprocessing steps into two separate generator expressions. The first expression removes the \n character from the end of each line. The second generator expression uses the optional if clause—it will get lines from the first generator expression and only pass lines if the length is not 0. This is a filter that rejects blank lines. The final for statement only gets nonblank lines that have had the \n character removed.
A ZIP archive contains one or more files. To use with with ZIP archives, we need to import the zipfile module:
import zipfile
Generally, we can open an archive using something like the following:
with zipfile.ZipFile("demo.zip", "r") as archive:This creates a context so that we can work with the file and be sure that it's properly closed at the end of the indented context.
When we want to create an archive, we can provide an additional parameter:
with ZipFile("test.zip", "w", compression=zipfile.zipfile.ZIP_DEFLATED) as archive:This will create a ZIP file that uses a simple compression algorithm to save space. If we're reading members of a ZIP archive, we can use a nested context to open this member file, as shown in the following snippet:
with archive.open("some_file") as member:
...process member... As we showed in Chapter 1, Our Espionage Toolkit, once we've opened a member for reading, it's similar to an ordinary OS file. The nested context allows us to use ordinary file processing operations on the member. We used the following example earlier:
import zipfile
with zipfile.ZipFile( "demo.zip", "r" ) as archive:
archive.printdir()
first = archive.infolist()[0]
with archive.open(first) as member:
text= member.read()
print( text )We used a context to open the archive. We used a nested context to open a member of the archive. Not all files can be read this way. Members that are images, for example, can't be read directly by Pillow; they must be extracted to a temporary file. We'd do something like this:
import zipfile
with zipfile.ZipFile( "photos.zip", "r" ) as archive:
archive.extract("warship.png")This will extract a member named warship.png from the archive and create a local file. Pillow can then work with the extracted file.
A JSON file contains a Python object that's been serialized in JSON notation. To work with JSON files, we need to import the json module:
import json
The file processing context doesn't really apply well to JSON files. We don't generally have the file open for any extended time when processing it. Often, the with statement context is just one line of code. We might create a file like this:
...create an_object...
with open("some_file.json", "w") as output:
json.save(an_object, output)This is all that's required to create a JSON-encoded file. Often, we'll contrive to make the object we're serializing a list or a dict so that we can save multiple objects in a single file. To retrieve the object, we generally do something that's similarly simple, as shown in the following code:
with open("some_file.json") as input:
an_object= json.load(input)
...process an_object...This will decode the object and save it in the given variable. If the file contains a list, we can iterate through the object to process each item in the list. If the file contains a dictionary, we might work with specific key values of this dictionary.
Once the Python object has been created, we no longer need the file context. The resources associated with the file can be released, and we can focus our processing steps on the resulting object.
CSV stands for comma-separated values. While one of the most common CSV formats uses the quote character and commas, the CSV idea is readily applicable to any file that has a column-separator character. We might have a file with each data item separated by tab characters, written as \t in Python. This is also a kind of CSV file that uses the tab character to fill the role of a comma.
We'll use the csv module to process these files:
import csv
When we open a CSV file, we must create a reader or writer that parses the various rows of data in the file. Let's say we downloaded the historical record of bitcoin prices. You can download this data from https://coinbase.com/api/doc/1.0/prices/historical.html. See Chapter 2, Acquiring Intelligence Data, for more information.
The data is in the CSV notation. Once we've read the string, we need to create a CSV reader around the data. As the data was just read into a big string variable, we don't need to use the filesystem. We can use in-memory processing to create a file-like object, as shown in the following code:
import io
with urllib.request.urlopen( query_history ) as document:
history_data= document.read().decode("utf-8")
reader= csv.reader( io.StringIO(history_data) )We've used the urllib.request.urlopen() function to make a GET request to the given URL. The response will be in bytes. We decoded the characters from these bytes and saved them in a variable named history_data.
In order to make this amenable to the csv.Reader class, we used the io.StringIO class to wrap the data. This creates a file-like object without actually wasting time to create a file on the disk somewhere.
We can now read individual rows from the reader object, as shown in the following code:
for row in reader:
print( row )This for loop will step through each row of the CSV file. The various columns of data will be separated; each row will be a tuple of individual column values.
If we have tab-separated data, we'd modify the reader by providing additional details about the file format. We might, for example, use rdr= csv.reader(some_file, delimiter='\t') to specify that there are tab-separated values instead of comma-separated ones.
An image is composed of picture elements called pixels. Each pixel is a dot. For computer displays, the individual dots are encoded using red-green-blue (RGB) colors. Each displayed pixel is a sum of the levels of red, green, and blue light. For printing, the colors might be switched to cyan-magenta-yellow-key (CMYK) colors.
An image file contains an encoding of the various pixels of the image. The image file may also contain metadata about the image. The metadata information is sometimes called tags and even Exif tags.
An image file can use a variety of encodings for each pixel. A pure black and white image only needs 1 bit for each pixel. High-quality photographs may use one byte for each color, leading to 24 bits per pixel. In some cases, we might add a transparency mask or look for even higher-resolution color. This leads to four bytes per pixel.
The issue rapidly turns into a question of the amount of storage required. A picture that fills an iPhone display has 326 pixels per inch. The display has 1136 by 640 pixels. If each pixel uses 4 bytes of color information, then the image involves 3 MB of memory.
Consider a scanned image that's of 8 1/2" by 11" at 326 pixels per inch The image is 2762 x 3586 pixels, a total of 39 MB. Some scanners are capable of producing images at 1200 pixels per inch: that file would be of 673 MB.
Different image files reflect different strategies to compress this immense amount of data without losing the quality of the image.
A naive compression algorithm can make the files somewhat smaller. TIFF files, for example, use a fairly simple compression. The algorithms used by JPEG, however, are quite sophisticated and lead to relatively small file sizes while retaining much—but not all—of the original image. While JPEG is very good at compressing, the compressed image is not perfect—details are lost to achieve good compression. This makes JPEG weak for steganography where we'll be tweaking the bits to conceal a message in an image.
We can call JPEG compression lossy because some bits can be lost. We can call TIFF compression lossless because all the original bits can be recovered. Once bits are lost, they can't be recovered. As our message will only be tweaking a few bits, JPEG compression can corrupt our hidden message.
When we work with images in Pillow, it will be similar to working with a JSON file. We'll open and load the image. We can then process the object in our program. When we're done, we'll save the modified image.