We're going to acquire intelligence data from a variety of sources. We might interview people. We might steal files from a secret underground base. We might search the World Wide Web (WWW), and this is what we'll focus on in this chapter. Using our own cameras or recording devices is the subject of the next chapter.
Important espionage targets include natural resources, popular opinion, and strategic economic strengths. This kind of background information is useful in a number of ways. A great deal of the world's data is already on the Web, and the rest will get there eventually. Any modern search for intelligence starts with the Web.
We can use Python libraries such as http.client and urllib to get data from remote servers and transfer files to other servers. Once we've found remote files of interest, we're going to need a number of Python libraries to parse and extract data from these libraries.
In Chapter 1, Our Espionage Toolkit, we looked at how we can peek inside a ZIP archive. We'll look inside other kinds of files in this chapter. We'll focus on JSON files, because they're widely used for web services APIs.
Along the way, we'll cover a number of topics:
The WWW and Internet are based on a series of agreements called Request for Comments (RFC). The RFCs define the standards and protocols to interconnect different networks, that is, the rules for internetworking. The WWW is defined by a subset of these RFCs that specifies the protocols, behaviors of hosts and agents (servers and clients), and file formats, among other details.
In a way, the Internet is a controlled chaos. Most software developers agree to follow the RFCs. Some don't. If their idea is really good, it can catch on, even though it doesn't precisely follow the standards. We often see this in the way some browsers don't work with some websites. This can cause confusion and questions. We'll often have to perform both espionage and plain old debugging to figure out what's available on a given website.
Python provides a variety of modules that implement the software defined in the Internet RFCs. We'll look at some of the common protocols to gather data through the Internet and the Python library modules that implement these protocols.
The essential idea behind the WWW is the Internet. The essential idea behind the Internet is the TCP/IP protocol stack. The IP part of this is the internetworking protocol. This defines how messages can be routed between networks. Layered on top of IP is the TCP protocol to connect two applications to each other. TCP connections are often made via a software abstraction called a socket. In addition to TCP, there's also UDP; it's not used as much for the kind of WWW data we're interested in.
In Python, we can use the low-level socket library to work with the TCP protocol, but we won't. A socket is a file-like object that supports open, close, input, and output operations. Our software will be much simpler if we work at a higher level of abstraction. The Python libraries that we'll use will leverage the socket concept under the hood.
The Internet RFCs defines a number of protocols that build on TCP/IP sockets. These are more useful definitions of interactions between host computers (servers) and user agents (clients). We'll look at two of these: Hypertext Transfer Protocol (HTTP) and File Transfer Protocol (FTP).
The essence of web traffic is HTTP. This is built on TCP/IP. HTTP defines two roles: host and user agent, also called server and client, respectively. We'll stick to server and client. HTTP defines a number of kinds of request types, including GET and POST.
A web browser is one kind of client software we can use. This software makes GET and POST requests, and displays the results from the web server. We can do this kind of client-side processing in Python using two library modules.
The http.client module allows us to make GET and POST requests as well as PUT and DELETE. We can read the response object. Sometimes, the response is an HTML page. Sometimes, it's a graphic image. There are other things too, but we're mostly interested in text and graphics.
Here's a picture of a mysterious device we've been trying to find. We need to download this image to our computer so that we can see it and send it to our informant from http://upload.wikimedia.org/wikipedia/commons/7/72/IPhone_Internals.jpg.

Here's a picture of the currency we're supposed to track down and pay with:

We need to download this image. Here is the link:
http://upload.wikimedia.org/wikipedia/en/c/c1/1drachmi_1973.jpg
Here's how we can use http.client to get these two image files:
import http.client
import contextlib
path_list = [
"/wikipedia/commons/7/72/IPhone_Internals.jpg",
"/wikipedia/en/c/c1/1drachmi_1973.jpg",
]
host = "upload.wikimedia.org"
with contextlib.closing(http.client.HTTPConnection( host )) as connection:
for path in path_list:
connection.request( "GET", path )
response= connection.getresponse()
print("Status:", response.status)
print("Headers:", response.getheaders())
_, _, filename = path.rpartition("/")
print("Writing:", filename)
with open(filename, "wb") as image:
image.write( response.read() )We're using http.client to handle the client side of the HTTP protocol. We're also using the contextlib module to politely disentangle our application from network resources when we're done using them.
We've assigned a list of paths to the path_list variable. This example introduces list objects without providing any background. We'll return to lists in the Organizing collections of data section later in the chapter. It's important that lists are surrounded by [] and the items are separated by ,. Yes, there's an extra , at the end. This is legal in Python.
We created an http.client.HTTPConnection object using the host computer name. This connection object is a little like a file; it entangles Python with operating system resources on our local computer plus a remote server. Unlike a file, an HTTPConnection object isn't a proper context manager. As we really like context managers to release our resources, we made use of the contextlib.closing() function to handle the context management details. The connection needs to be closed; the closing() function assures that this will happen by calling the connection's close() method.
For all of the paths in our path_list, we make an HTTP GET request. This is what browsers do to get the image files mentioned in an HTML page. We print a few things from each response. The status, if everything worked, will be 200. If the status is not 200, then something went wrong and we'll need to read up on the HTTP status code to see what happened.
An HTTP response includes headers that provide some additional details about the request and response. We've printed the headers because they can be helpful in debugging any problems we might have. One of the most useful headers is ('Content-Type', 'image/jpeg'). This confirms that we really did get an image.
We used _, _, filename = path.rpartition("/") to locate the right-most / character in the path. Recall that the partition() method locates the left-most instance. We're using the right-most one here. We assigned the directory information and separator to the variable _. Yes, _ is a legal variable name. It's easy to ignore, which makes it a handy shorthand for we don't care. We kept the filename in the filename variable.
We create a nested context for the resulting image file. We can then read the body of the response—a collection of bytes—and write these bytes to the image file. In one quick motion, the file is ours.
The HTTP GET request is what underlies much of the WWW. Programs such as curl and wget are expansions of this example. They execute batches of GET requests to locate one or more pages of content. They can do quite a bit more, but this is the essence of extracting data from the WWW.
An HTTP GET request includes several headers in addition to the URL. In the previous example, we simply relied on the Python http.client library to supply a suitable set of default headers. There are several reasons why we might want to supply different or additional headers.
First, we might want to tweak the User-Agent header to change the kind of browser that we're claiming to be. We might also need to provide cookies for some kinds of interactions. For information on the user agent string, see http://en.wikipedia.org/wiki/User_agent_string#User_agent_identification.
This information may be used by the web server to determine if a mobile device or desktop device is being used. We can use something like this:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14
This makes our Python request appear to come from the Safari browser instead of a Python application. We can use something like this to appear to be a different browser on a desktop computer:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:28.0) Gecko/20100101 Firefox/28.0
We can use something like this to appear to be an iPhone instead of a Python application:
Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_1 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D201 Safari/9537.53
We make this change by adding headers to the request we're making. The change looks like this:
connection.request( "GET", path, headers= {
'User-Agent':
'Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_1 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D201 Safari/9537.53',
})This will make the web server treat our Python application like it's on an iPhone. This might lead to a more compact page of data than might be provided to a full desktop computer that makes the same request.
The header information is a structure with the { key: value, } syntax. This is a dictionary. We'll return to dictionaries in the following Organizing collections of data section. It's important that dictionaries are surrounded by {}, the keys and values are separated by :, and each key-value pair is separated by ,. Yes, there's an extra , at the end. This is legal in Python.
There are many more HTTP headers we can provide. The User-Agent header is perhaps most important to gather different kinds of intelligence data from web servers.
FTP specifies ways to transfer files between computers. There are two principle variants: the original FTP and the more secure version, FTPS. This more secure version uses SSL to assure that the lower-level sockets are fully encrypted. It's sometimes called FTP_TLS, FTP with transport layer security.
The SSH standard includes a file-transfer protocol, SFTP. This is a part of SSH and is separate from other FTP variants. This is supported by the ftplib module, even though it's really a different protocol.
In some cases, FTP access is anonymous. No security credentials (such as usernames or passwords) are used. This is usually reserved for download-only content. Sometimes, anonymous access expects a placeholder username and password—the username should be anonymous, and typically, your e-mail address is used as a password. In other cases, we need to have proper credentials. We'll focus on publicly accessible FTP servers.
We're going to look for the CIA World Factbooks. We know that there are copies in Project Gutenberg. This leads us to use the ftp.ibiblio.org server as the target of our investigation. The base URL is ftp://ftp.ibiblio.org/pub/docs/books/gutenberg/.
FTP has its own language of commands used to examine remote (and local) filesystems, create and remove directories, as well as get and put files. Some of this language is exposed through the Python FTP module. Some of it is kept hidden under the hood.
We can see some top-level documents available on the Project Gutenberg server with a script like the following. Here's our initial step in discovering the data:
import ftplib
host = "ftp.ibiblio.org"
root = "/pub/docs/books/gutenberg/"
def directory_list( path ):
with ftplib.FTP(host, user="anonymous") as connection:
print("Welcome", connection.getwelcome())
for name, details in connection.mlsd(path):
print(name, details['type'], details.get('size'))
directory_list(root)We imported the FTP library. We'll need this to do anything using the FTP protocol. We assigned the host, host, and root path, root, as strings. We'll use this in several functions that we need to define.
We defined a directory_list() function that will display names, types, and sizes from a directory. This lets us explore the files in our local directories. We'll use this function with different parameters after we've tracked down the directory with our candidate files.
The directory_list() function opens a context using a ftplib.FTP object. We don't need to use the contextlib.closing() function, because this context is well behaved. This object will manage the various sockets used to exchange data with the FTP server. One of the methods, getwelcome(), retrieves any welcome message. We'll see that this is pretty short. Sometimes, they're more elaborate.
We'll dump the top-level directory information that shows the various files, directories, and their sizes. The details['type'] syntax is how we pick a particular name out of the name-value pairs in a dictionary. The details.get('size') syntax does a similar thing. Getting an item with [] will raise an exception if the name is not found. Getting an item with the get() method supplies a default value instead of an exception. Unless specified otherwise, the default value is None.
We're making the claim that the details dictionary must have a type item. If it doesn't, the program will crash, because something's very wrong. We're also making the claim that the details dictionary might or might not have the size item. If the size isn't present, the None value will do instead.
There are a number of files here. The README and GUTINDEX.ALL files look promising; let's examine them.
The FTP library relies on a technique called a callback function to support incremental processing. Downloading a 13 MB file takes some time. Having our computer just doze off while downloading is impolite. It's good to provide some ongoing status with respect to progress (or lack of it thereof).
We can define callback functions in a number of ways. If we're going to use class definitions, the callback function will simply be another method of the class. Class definitions get a bit beyond the scope of our book. They're quite simple, but we have to focus on espionage, not software design. Here's a general-purpose get() function:
import sys
def get( fullname, output=sys.stdout ):
download= 0
expected= 0
dots= 0
def line_save( aLine ):
nonlocal download, expected, dots
print( aLine, file=output )
if output != sys.stdout:
download += len(aLine)
show= (20*download)//expected
if show > dots:
print( "-", end="", file=sys.stdout )
sys.stdout.flush()
dots= show
with ftplib.FTP( host, user="anonymous" ) as connection:
print( "Welcome", connection.getwelcome() )
expected= connection.size( fullname )
print( "Getting", fullname, "to", output, "size", expected )
connection.retrlines( "RETR {0}".format(fullname), line_save )
if output != sys.stdout:
print() # End the "dots" The get() function contains a function definition buried inside it. The line_save() function is the callback function that's used by the retrlines() function of an FTP connection. Each line of data from the server will be passed to the line_save() function to process it.
Our line_save() function uses three nonlocal variables: download, expected, and dots. These variables are neither global nor are they local to the line_save() function. They're initialized before any lines are downloaded, and they are updated within the line_save() function on a line-by-line basis. As they are a saved state for the line_save() function, we need to notify Python not to create local variables when these are used in an assignment statement.
The function's primary job is to print the line to the file named in the output variable. Interestingly, the output variable is also nonlocal. As we never try to assign a new value to this variable, we don't need to notify Python about its use in an assignment statement. A function has read access to nonlocal variables; write access requires special arrangements via the global or nonlocal statements.
If the output file is sys.stdout, we're displaying the file on the console. Writing status information is just confusing. If the output file is not sys.stdout, we're saving the file. Showing some status is helpful.
We compute how many dots (from 0 to 19) to show. If the number of dots has increased, we'll print another dash. Yes, we called the variable dots but decided to print dashes. Obscurity is never a good thing. You might want to take an independent mission and write your own version, which is clearer than this.
The get() function creates a context using an ftplib.FTP object. This object will manage the various sockets used to exchange data with the FTP server. We use the getwelcome() method to get the welcome message. We use the
size() method to get the size of the file we're about to request. By setting the expected variable, we can assure that up to 20 dashes are displayed to show the state of the download.
The retrlines() method of the connection requires an FTP command and a callback function. It sends the command; each line of the response is sent to the callback function.
We can use this get() function to download files from the server. We'll start with two examples of extracting files from an FTP server:
# show the README on sys.stdout
get(root+"README")
# get GUTINDEX.ALL
with open("GUTINDEX.ALL", "w", encoding="UTF-8") as output:
get(root+"GUTINDEX.ALL", output)The first example is a small file. We'll display the README file, which might have useful information. It's usually small, and we can write it to stdout immediately. The second example will open a file processing context to save the large GUTINDEX.ALL file locally for further analysis. It's quite large, and we certainly don't want to display it immediately. We can search this index file for CIA World Factbooks. There are several Factbooks.
The introduction to the GUTINDEX.ALL file describes how document numbers turn into directory paths. One of the CIA World Factbooks, for example, is document number 35830. This becomes the directory path 3/5/3/35380/. The document will be in this directory.
We can use our directory_list() function to see what else is there:
directory_list( root+"3/5/8/3/35830/" )
This will show us that there are several subdirectories and a ZIP file that appears to have images. We'll start with the text document. We can use our get() function to download the CIA Factbook in a script like the following:
with open("35830.txt", "w", encoding="UTF-8") as output:
get(root+"3/5/8/3/35830/"+"35830.txt", output)This gets us one of the CIA World Factbooks. We can easily track down the others. We can then analyze information from these downloaded documents.
The urllib package wraps HTTP, FTP, and local file access in a single, tidy package. In the most common situations, this package allows us to elide some of the processing details we saw in the previous examples.
The advantage of the general approach in urllib is that we can write smallish programs that can work with data from a wide variety of locations. We can rely on urllib to work with HTTP, FTP, or local files seamlessly. The disadvantage is that we can't do some more complex HTTP or FTP interactions. Here's an example of downloading two images with the urllib version of the HTTP get function:
import urllib.request
url_list = [
"http://upload.wikimedia.org/wikipedia/commons/7/72/IPhone_Internals.jpg",
"http://upload.wikimedia.org/wikipedia/en/2/26/Common_face_of_one_euro_coin.jpg",
]
for url in url_list:
with urllib.request.urlopen( url ) as response:
print( "Status:", response.status )
_, _, filename = response.geturl().rpartition("/")
print( "Writing:", filename )
with open( filename, "wb" ) as image:
image.write( response.read() )We've defined two URLs. When using urllib, we can provide full URLs without having to distinguish between the host and the path we're tying to access.
We create a context using urllib.request.urlopen(). This context will contain all of the resources used for getting the file from the World Wide Web. The response object is called a file-like object in Python parlance. We can use it the way we'd use a file: it supports read() and readline() methods. It can be used in a for statement to iterate over lines of a text file.
We can use a simple urllib.request to get a file via FTP. We can simply change the URL to reflect the protocol we're using. Something like this works well:
import sys
import urllib.request
readme= "ftp://ftp.ibiblio.org/pub/docs/books/gutenberg/README"
with urllib.request.urlopen(readme) as response:
sys.stdout.write( response.read().decode("ascii") )This will open the source file and print it on sys.stdout. Note that we had to decode the bytes from ASCII to create proper Unicode characters for use by Python. We can print the other status and header information if we find it necessary.
We can also use a local file URL. The schema is file: instead of http: or ftp:. Generally, the hostname is omitted, thus leading to file URLs like this:
local= "file:///Users/slott/Documents/Writing/Secret Agent's Python/currency.html"
Using urllib leads to a few pleasant simplifications. We can treat resources located across the WWW with code that's similar to handling a local file. Remote resources are often slower than local files; we might want to give up waiting after a period of time. Also, there's the possibility of network disconnections. Our error handling needs to be more robust when working with remote data.