This chapter will first show you how to use Python to get information from APIs that are used now to share data between organizations, and then highlight the tools that most Python-powered organizations would use to support communication within their own infrastructure.
We already discussed Python’s support for pipes and queues across processes in “Multiprocessing”. Communicating between computers requires the computers at both ends of the conversation use a defined set of protocols—the Internet adheres to the TCP/IP suite.1 You can implement UDP yourself over sockets, Python provides a library called ssl for TLS/SSL wrappers over sockets, and asyncio to implement asynchronous transports for TCP, UDP, TLS/SSL, and subprocess pipes.
But most of us will be using the higher-level libraries that provide
clients implementing various application-level protocols:
ftplib, poplib, imaplib, nntplib, smtplib, telnetlib, and xmlrpc.
All of them provide classes for both regular and TLS/SSL wrapped clients
(and urllib exists for HTTP requests, but recommends the Requests library for most uses).
The first section in this chapter covers HTTP requests—how to get data from public APIs on the Web. Next is a brief aside about serialization in Python, and the third section describes popular tools used in enterprise-level networking. We’ll try to explicitly say when something is only available in Python 3. If you’re using Python 2 and can’t find a module or class we’re talking about, we recommend checking this list of changes between the Python 2 and Python 3 Standard Libraries.
The Hypertext Transfer Protocol (HTTP) is an application protocol for distributed, collaborative, hypermedia information systems and is the foundation of data communication for the World Wide Web. We’re focusing this entire section on how to get data from the Web using the Requests library.
Python’s standard urllib module provides most of the HTTP capabilities you
need, but at a low level, that requires quite a bit of work to perform
seemingly simple tasks (like getting data from an HTTPS server that requires
authentication). The documentation for the urllib.request module
actually says to use the Requests library instead.
Requests takes all of the work out of
Python HTTP requests—making your integration
with web services seamless. There’s no need to manually add query strings to
your URLs, or to form-encode your POST data. Keep-alive (persistent HTTP connections)
and HTTP connection pooling are available through the request.sessions.Session class,
powered by urllib3,
which is embedded within Requests (meaning you don’t need to install it separately).
Get it using pip:
$ pip install requests
The Requests documentation goes into more detail than what we’ll cover next.
Nearly everybody, from the US Census to the Dutch National Library, has an API that you can use to get the data they want to share; and some, like Twitter and Facebook, allow you (or the apps you use) to also modify that data. You may hear the term RESTful API. REST stands for representational state transfer—it is a paradigm that informed how HTTP 1.1 was designed, but is not a standard, protocol, or requirement. Still, most web service API providers follow the RESTful design principles. We’ll use some code to illustrate common terms:
importrequestsresult=requests.get('http://pypi.python.org/pypi/requests/json')

The method is part of the HTTP protocol. In a RESTful API, the API designer chooses what action the server will take, and tells you in their API documentation. Here is a list of all of the methods, but the ones commonly available in RESTful APIs are GET, POST, PUT, and DELETE.
Usually, these “HTTP verbs” do what their meaning implies, getting data, changing data, or deleting it.

The base URI is the root of the API.

Clients would specify a specific element they want data on.

And there may be an option for different media types.
That code actually performed an HTTP request to http://pypi.python.org/pypi/requests/json, which is the JSON backend for PyPI. If you look at it in your browser, you will see a large JSON string. In Requests, the return value of an HTTP request is a Response object:
>>>importrequests>>>response=requests.get('http://pypi.python.org/pypi/requests/json')>>>type(response)<class'requests.models.Response'>>>>response.okTrue>>>response.text# This gives all of the text of the response>>>response.json()# This converts the text response into a dictionary
PyPI gave us the text in JSON format. There isn’t a rule about the format to send data in, but many APIs use JSON or XML.
Javascript Object Notation (JSON) is exactly what it says—the notation used to define objects in JavaScript. The Requests library has built a JSON parser into its Response object.
The json library can parse JSON from strings or files into a Python dictionary (or list, as appropriate). It can also convert Python dictionaries or lists into JSON strings. For example, the following string contains JSON data:
json_string='{"first_name": "Guido", "last_name":"van Rossum"}'
It can be parsed like this:
importjsonparsed_json=json.loads(json_string)
and can now be used as a normal dictionary:
(parsed_json['first_name'])"Guido"
You can also convert the following to JSON:
d={'first_name':'Guido','last_name':'van Rossum','titles':['BDFL','Developer'],}(json.dumps(d))'{"first_name": "Guido", "last_name": "van Rossum","titles":["BDFL","Developer"]}'
There is an XML parser in the Standard Library (xml.etree.ElementTree’s parse() and fromstring() methods),
but this uses the Expat library and
creates an ElementTree object that preserves the structure of the XML, meaning
we have to iterate down it and look into its children to get content.
When all you want is to get the data, try
either untangle or xmltodict. You can get both using pip:
$pip install untangle$pip install xmltodict
untangle takes an XML document and returns a Python object whose structure mirrors the nodes and attributes. For example, an XML file like this:
<?xml version="1.0" encoding="UTF-8"?><root><childname="child1"/></root>
can be loaded like this:
importuntangleobj=untangle.parse('path/to/file.xml')
and then you can get the child element’s name like this:
obj.root.child['name']# is 'child1'
xmltodict converts the XML to a dictionary. For example, an XML file like this:
<mydocumenthas="an attribute"><and><many>elements</many><many>more elements</many></and><plusa="complex">element as well</plus></mydocument>
can be loaded into an OrderedDict instance (from the collections module in Python’s Standard Library) like this:
importxmltodictwithopen('path/to/file.xml')asfd:doc=xmltodict.parse(fd.read())
and then you can access elements, attributes, and values like this:
doc['mydocument']['@has']# is u'an attribute'doc['mydocument']['and']['many']# is [u'elements', u'more elements']doc['mydocument']['plus']['@a']# is u'complex'doc['mydocument']['plus']['#text']# is u'element as well'
With xmltodict, you can also roundtrip the dictionary back to XML with the
unparse() function. It has a streaming mode suitable for handling
files that don’t fit in memory, and it supports namespaces.
Websites don’t always provide their data in comfortable formats such as CSV or JSON, but HTML is also structured data—this is where web scraping comes in.
Web scraping is the practice of using a computer program to sift through a web page and gather the data that you need in a format most useful to you while at the same time preserving the structure of the data.
More and more now, as sites offer APIs, they explicitly request you to not scrape their data—the API presents the data they are willing to share, and that’s it. Before getting started, check around the website you’re looking at for a Terms of Use statement, and be a good citizen of the Web.
lxml is a pretty extensive library written for parsing
XML and HTML documents very quickly, even handling some amount of incorrectly formatted markup in the process. Get it using pip:
$ pip install lxml
Use requests.get to retrieve the web page with our data,
parse it using the html module, and save the results in tree:
fromlxmlimporthtmlimportrequestspage=requests.get('http://econpy.pythonanywhere.com/ex/001.html')tree=html.fromstring(page.content)

This is a real web page, and the data we show are real—you can visit the page in your browser.

We use page.content rather than page.text because
html.fromstring() implicitly expects bytes as input.
Now, tree contains the whole HTML file in a nice tree structure that
we can go over in two different ways: XPath
or CSSSelect. They are both standard
ways to specify a path through an HTML tree, defined and maintained
by the World Wide Web Consortium (W3C), and implemented as modules in lxml.
In this example, we will use XPath.
A good introduction is
W3Schools XPath tutorial.
There are also various tools for obtaining the XPath of elements from inside your web browser, such as Firebug for Firefox or the Chrome Inspector. If you’re using Chrome, you can right-click an element, choose “Inspect element”, highlight the code, right-click again and choose “Copy XPath”.
After a quick analysis, we see that in our page the data is contained in
two elements—one is a div with title buyer-name, and the other is a
span with the class item-price:
<divtitle="buyer-name">Carson Busses</div><spanclass="item-price">$29.95</span>
Knowing this, we can create the correct XPath query and use lxml’s
xpath function like this:
# This will create a list of buyers:buyers=tree.xpath('//div[@title="buyer-name"]/text()')# This will create a list of pricesprices=tree.xpath('//span[@class="item-price"]/text()')
Let’s see what we got exactly:
>>>('Buyers: ',buyers)Buyers:['Carson Busses','Earl E. Byrd','Patty Cakes','Derri Anne Connecticut','Moe Dess','Leda Doggslife','Dan Druff','Al Fresco','Ido Hoe','Howie Kisses','Len Lease','Phil Meup','Ira Pent','Ben D. Rules','Ave Sectomy','Gary Shattire','Bobbi Soks','Sheila Takya','Rose Tattoo','Moe Tell']>>>>>>('Prices: ',prices)Prices:['$29.95','$8.37','$15.26','$19.25','$19.25','$13.99','$31.57','$8.49','$14.47','$15.86','$11.11','$15.98','$16.27','$7.50','$50.85','$14.26','$5.68','$15.00','$114.07','$10.09']
Data serialization is the concept of converting structured data into a format that allows it to be shared or stored—retaining the information necessary to reconstruct the object in memory at the receiving end of the transmission (or upon read from storage). In some cases, the secondary intent of data serialization is to minimize the size of the serialized data, which then minimizes disk space or bandwidth requirements.
The sections that follow cover the Pickle format, which is specific to Python, some cross-language serialization tools, compression options in Python’s Standard Library, and Python’s buffer protocol, which can reduce the number of times an object is copied before transmission.
The native data serialization module for Python is called Pickle. Here’s an example:
importpickle# Here's an example dictgrades={'Alice':89,'Bob':72,'Charles':87}# Use dumps to convert the object to a serialized stringserial_grades=pickle.dumps(grades)# Use loads to de-serialize an objectreceived_grades=pickle.loads(serial_grades)
Some things cannot be pickled—functions, methods, classes, and ephemeral things like pipes.
According to Python’s Pickle documentation, “The pickle module is not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.”
If you’re looking for a serialization module that has support in multiple languages, two popular options are Google’s Protobuf and Apache’s Avro.
Also, Python’s Standard Library includes xdrlib to pack and unpack Sun’s External Data Representation (XDR) format, which is independent of operating system and transport protocol. It’s much lower level than the preceding options and just concatenates packed bytes together, so both the client and server must know the type and order of packing. Here’s an example of what a server receiving data in XDR format could look like:
importsocketserverimportxdrlibclassXdrHandler(socketserver.BaseRequestHandler):defhandle(self):data=self.request.recv(4)unpacker=xdrlib.Unpacker(data)message_size=self.unpacker.unpack_uint()data=self.request.recv(message_size)unpacker.reset(data)(unpacker.unpack_string())(unpacker.unpack_float())self.request.sendall(b'ok')server=socketserver.TCPServer(('localhost',12345),XdrHandler)server.serve_forever()

The data could be of variable length, so we added a packed unsigned integer (4 bytes) with the message size first.

We had to already know we were receiving an unsigned int.

Read the rest of the message on this line first,…

…and on the next line, reset the unpacker with the new data.

We must know a priori that we’ll receive one string and then one float.
Of course, if both sides were actually Python programs, you’d be using Pickles. But if the server was from something totally different, this would be the corresponding code for a client sending the data:
importsocketimportxdrlibp=xdrlib.Packer()p.pack_string('Thanks for all the fish!')p.pack_float(42.00)xdr_data=p.get_buffer()message_length=len(xdr_data)p.reset()p.pack_uint(message_length)len_plus_data=p.get_buffer()+xdr_datawithsocket.socket()ass:s.connect(('localhost',12345))s.sendall(len_plus_data)ifs.recv(1024):('success')
Python’s Standard Library also contain support for data compression and decompression using the zlib, gzip, bzip2, and lzma algorithms, and the creation of ZIP- and tar-format archives. To zip a Pickle, for example:
importpickleimportgzipdata="my very big object"# To zip and pickle:withgzip.open('spam.zip','wb')asmy_zip:pickle.dump(data,my_zip)# And to unzip and unpickle:withgzip.open('spam.zip','rb')asmy_zip:unpickled_data=pickle.load(my_zip)
Eli Bendersky, one of Python’s core developers, wrote a blog post about reducing the number of in-memory copies Python makes of the same data by using memory buffers. With his technique, you can even read from a file or socket into an existing buffer. For more information, see Python’s buffer protocol documentation and PEP 3118, which suggested enhancements that were implemented in Python 3 and backported to Python 2.6 and above.
Distributed computer systems collectively accomplish a task (like game play, or an Internet chat room, or a Hadoop calculation) by passing information to each other. This section first lists our most popular libraries for common networking tasks, and then discusses cryptography, which comes hand in hand with this kind of communication.
In Python, communication for connected networks is usually handled with asynchronous tools or threads, to get around the single-thread limitation of the Global Interpreter Lock. All of the libraries in Table 9-1 solve the same problem—getting around the GIL—with different numbers and with varying amounts of additional features.
| Library | License | Reasons to use |
|---|---|---|
asyncio |
PSF license |
|
gevent |
MIT license |
|
Twisted |
MIT license |
|
PyZMQ |
LGPL (ZMQ) and |
|
pika |
BSD license |
|
Celery |
BSD license |
|
asyncio was introduced in Python 3.4 and includes ideas learned from the developer communities, like those maintaining Twisted and gevent. It’s a concurrency tool, and a frequent application of concurrency is for network servers. Python’s own documentation for asyncore (a predecessor to asyncio), states:
There are only two ways to have a program on a single processor do “more than one thing at a time.” Multi-threaded programming is the simplest and most popular way to do it, but there is another very different technique, that lets you have nearly all the advantages of multi-threading, without actually using multiple threads. It’s really only practical if your program is largely I/O bound. If your program is processor bound, then pre-emptive scheduled threads are probably what you really need. Network servers are rarely processor bound, however.
asyncio is still only in the Python Standard Library on a provisional basis—the API may change in backward-incompatible ways—so don’t get too attached.
Not all of it is new—asyncore (deprecated in Python 3.4) has an event loop,
asynchronous sockets2 and asynchronous file I/O, and asynchat (also
deprecated in Python 3.4) had asynchronous queues.3
The big thing asyncio adds is a formalized implementation of coroutines.
In Python, this is formally defined as both a coroutine function—a function definition beginning
with async def rather
than just def (or uses the older syntax, and is decorated with @asyncio.coroutine)—and also the object obtained by calling a coroutine function (which is usually
some sort of computation or I/O operation).
The coroutine can yield the processor and thus
be able to participate in an asynchronous event loop, taking turns with
other coroutines.
The documentation has pages and pages of detailed examples to help the community, as it’s a new concept for the language. It’s clear, thorough, and very much worth checking out. In this interactive session, we just want to show the functions for the event loop and some of the classes available:
>>>importasyncio>>>>>>[lforlinasyncio.__all__if'loop'inl]['get_event_loop_policy','set_event_loop_policy','get_event_loop','set_event_loop','new_event_loop']>>>>>>[tfortinasyncio.__all__ift.endswith('Transport')]['BaseTransport','ReadTransport','WriteTransport','Transport','DatagramTransport','SubprocessTransport']>>>>>>[pforpinasyncio.__all__ifp.endswith('Protocol')]['BaseProtocol','Protocol','DatagramProtocol','SubprocessProtocol','StreamReaderProtocol']>>>>>>[qforqinasyncio.__all__if'Queue'inq]['Queue','PriorityQueue','LifoQueue','JoinableQueue','QueueFull','QueueEmpty']
gevent is a coroutine-based Python networking library that uses greenlets to provide a high-level synchronous API on top of the C library libev event loop. Greenlets are based on the greenlet library—miniature green threads (or user-level threads, as opposed to threads controlled by the kernel) that the developer has the freedom to explicitly suspend, jumping between greenlets. For a great deep dive into gevent, check out Kavya Joshi’s seminar “A Tale of Concurrency Through Creativity in Python.”
People use gevent because it is lightweight and tightly coupled to its underlying
C library, libev, for high performance. If you like the idea of integrating
asynchronous I/O and greenlets, this is the library to use.
Get it using pip:
$ pip install gevent
Here’s an example from the greenlet documentation:
>>>importgevent>>>>>>fromgeventimportsocket>>>urls=['www.google.com','www.example.com','www.python.org']>>>jobs=[gevent.spawn(socket.gethostbyname,url)forurlinurls]>>>gevent.joinall(jobs,timeout=2)>>>[job.valueforjobinjobs]['74.125.79.106','208.77.188.166','82.94.164.162']
The documentation offers many more examples.
Twisted is an event-driven networking
engine. It can be used to build applications around many different networking
protocols, including HTTP servers and clients, applications using SMTP, POP3,
IMAP or SSH protocols, instant messaging
and much more.
Install it using pip:
$ pip install twisted
Twisted has been around since 2002 and has a loyal community.
It’s like the Emacs of coroutine libraries—with everything
built in—because all of these things have to be asynchronous to work
together.
Probably the most useful tools are an asynchronous
wrapper for database connections (in twisted.enterprise.adbapi),
a DNS server (in twisted.names), direct access to packets (in twisted.pair),
and additional protocols like AMP, GPS, and SOCKSv4 (in twisted.protocols).
Most of Twisted now works with Python 3—when you pip install in a Python 3 environment, you’ll get
get everything that’s currently been ported.
If you find something you wanted in the
API
that’s not in your Twisted, you should still use Python 2.7.
For more information, consult Jessica McKellar and Abe Fettig’s Twisted (O’Reilly). In addition, this webpage shows over 42 Twisted examples, and this one shows their latest speed performance.
PyZMQ is the Python binding for
ZeroMQ. You can get it using pip:
$ pip install pyzmq
ØMQ (also spelled ZeroMQ, 0MQ, or ZMQ) describes itself as a messaging library designed to have a familiar socket-style API, and aimed at use in scalable distributed or concurrent applications. Basically, it implements asynchronous sockets with queues attached and provides a custom list of socket “types” that determine how the I/O on each socket behaves. Here’s an example:
importzmqcontext=zmq.Context()server=context.socket(zmq.REP)server.bind('tcp://127.0.0.1:5000')whileTrue:message=server.recv().decode('utf-8')('Client said: {}'.format(message))server.send(bytes('I don'tknow.','utf-8'))# ~~~~~ and in another file ~~~~~importzmqcontext=zmq.Context()client=context.socket(zmq.REQ)client.connect('tcp://127.0.0.1:5000')client.send(bytes("What's for lunch?",'utf-8'))response=client.recv().decode('utf-8')('Server replied: {}'.format(response))

The socket type zmq.REP corresponds to their “request-response” paradigm.

Like with normal sockets, you bind the server to an IP and port.

The client type is zmq.REQ—that’s all, ZMQ defines a number of these
as constants: zmq.REQ, zmq.REP, zmq.PUB, zmq.SUB, zmq.PUSH, zmq.PULL, zmq.PAIR.
They determine how the socket’s sending and receiving behaves.

As usual, the client connects to the server’s bound IP and port.
So, these look and quack like sockets, enhanced with queues and various I/O patterns. The point of the patterns is to provide the building blocks for a distributed network. The basic patterns for the socket types are:
zmq.REQ and zmq.REP
connect a set of clients to a set of services. This can be for a
remote procedure call pattern or a task distribution pattern.
zmq.PUB and zmq.SUB
connect a set of publishers to a set of subscribers.
This is a data distribution pattern—one node
is distributing data to other nodes, or this can
be chained to fan out into a distribution tree.
zmq.PUSH and zmq.PULL
connect nodes in a fan-out/fan-in pattern that
can have multiple steps, and loops. This is a parallel
task distribution and collection pattern.
One great advantage of ZeroMQ over message-oriented middleware is that it can be used for message queuing without a dedicated message broker. PyZMQ’s documentation notes some enhancements they added, like tunneling via SSH. The rest of the documentation for the ZeroMQ API is better on the main ZeroMQ guide.
RabbitMQ is an open source message broker software that implements the Advanced Message Queuing Protocol (AMQP). A message broker is an intermediary program that receives messages from senders and sends them to receivers according to a protocol. Any client that also implements AMQP can communicate with RabbitMQ. To get RabbitMQ, go to the RabbitMQ download page, and follow the instructions for your operating system.
Client libraries that interface with the broker are available for all major programming languages. The top two for Python are pika and Celery—either can be installed with pip:
$pip install pika$pip install celery
pika is a lightweight, pure-Python AMQP 0-9-1 client, preferred by RabbitMQ. RabbitMQ’s introductory tutorials for Python use pika. There’s also an entire page of examples to learn from. We recommend playing with pika when you first set up RabbitMQ, regardless of your final library choice, because it is straightforward without the extra features and so crystallizes the concepts.
Celery is a much more featureful AMQP client—it can use either RabbitMQ or Redis (a distributed in-memory data store) as a message broker, can track the tasks and results (and optionally store them in a user-selected backend), and has a web administration tool/task monitor, Flower. It is popular in the web development community, and there are integration packages for Django, Pyramid, Pylons, web2py, and Tornado (Flask doesn’t need one). Start with the Celery tutorial.
In 2013, the Python Cryptographic Authority (PyCA) was formed. They are a group of developers all interested in providing high-quality cryptography libraries to the Python community.4 They provide tools to encrypt and decrypt messages given the appropriate keys, and cryptographic hash functions to irreversibly but repeatably obfuscate passwords or other secret data.
Except for pyCrypto, all of the libraries in Table 9-2 are maintained by the PyCA. Almost all are built on the C library OpenSSL, except when noted.
| Option | License | Reason to use |
|---|---|---|
ssl and hashlib |
Python Software Foundation |
|
pyOpenSSL |
Apache v2.0 license |
|
PyNaCl |
Apache v2.0 license |
|
libnacl |
Apache license |
|
cryptography |
Apache v2.0 license |
|
pyCrypto |
Public Domain |
|
bcrypt |
Apache v2.0 license |
|
a libsodium is a fork of the Networking and Cryptography library (NaCl, pronounced “salt”); its philosophy is to curate specific algorithms that are performant and easy to use. b The library actually contains the C source code and builds it on installation using the C Fast Function Interface we described earlier. Bcrypt is based on the Blowfish encryption algorithm. | ||
The following sections provide additional details about the libraries listed in Table 9-2.
The ssl module
in Python’s Standard Library provides a socket API (ssl.socket)
that behaves like a standard socket, but is wrapped by the SSL protocol,
plus ssl.SSLContext, which contains an SSL connection’s
configurations. And http (or httplib in Python 2) also uses
it for HTTPS support.
If you’re using Python 3.5, you also have
memory BIO support—so the socket writes I/O to a buffer instead of its destination, enabling
things like hooks for hex encoding/decoding before write/upon read.
Major security enhancements happened in Python 3.4—detailed in the release notes—to support newer transport protocols and hash algorithms. These issues were so important that they were backported to Python 2.7 as described in PEP 466 and PEP 476. You can learn all about them in Benjamin Peterson’s talk about the state of ssl in Python.
If you’re using Python 2.7, be sure you have at least 2.7.9, or that your version at least has incorporated PEP 476—so that by default HTTP clients will perform certificate verification when connecting using the https protocol. Or, just always use Requests because that has always been its default.
The Python team recommends using the SSL defaults if you have no special requirements for your security policy for client use. This example showing a secure mail client is from the section within the documentation for the ssl library, “Security considerations,” which you should read if you’re going to use the library:
>>>importssl,smtplib>>>smtp=smtplib.SMTP("mail.python.org",port=587)>>>context=ssl.create_default_context()>>>smtp.starttls(context=context)(220,b'2.0.0 Ready to start TLS')
To confirm that a message didn’t get corrupted during transmission,
use the hmac module, which implements the Keyed-Hashing for Message Authentication (HMAC)
algorithm described in
RFC 2104.
It works with a message hashed with any of the algorithms in the set
hashlib.algorithms_available.
For more, see the Python
Module of the Week’s hmac example.
And if it’s installed, hmac.compare_digest()
provides a constant-time comparison between digests to help
protect against timing attacks—where the attacker attempts to
infer your algorithm from the time it takes to run the digest comparison.
Python’s hashlib module can be used to generate hashed passwords for secure storage or checksums to confirm data integrity during transmission. The Password-Based Key Derivation Function 2 (PBKDF2), recommended in NIST Special Publication 800-132, is currently considered one of the best options for password hashing. Here’s an example use of the function using a salt5 and 10,000 iterations of the Secure Hash Algorithm 256-bit hash (SHA-256) to generate a hashed password (the choices for different hash algorithms or iterations let the programmer balance robustness with a desired response speed):
importosimporthashlibdefhash_password(password,salt_len=16,iterations=10000,encoding='utf-8'):salt=os.urandom(salt_len)hashed_password=hashlib.pbkdf2_hmac(hash_name='sha256',password=bytes(password,encoding),salt=salt,iterations=iterations)returnsalt,iterations,hashed_password
The secrets library was proposed in PEP 506 and will be available starting with Python 3.6. It provides functions for generating secure tokens, suitable for applications such as password resets and hard-to-guess URLs. Its documentation contains examples and best-practice recommendations to manage a basic level of security.
When Cryptography came out, pyOpenSSL updated its bindings to use Cryptography’s CFFI-based bindings for the OpenSSL library and joined the PyCA umbrella. pyOpenSSL is separate from the Python Standard Library on purpose so that it can release updates at the speed of the security community6—it’s built on the newest OpenSSL, and not, like Python is, built on the OpenSSL that comes with your operating system (unless you build it yourself against a newer version). Generally if you’re building a server, you’d want to use pyOpenSSL—see Twisted’s SSL documentation for an example of how they use pyOpenSSL.
Install it using pip:
$ pip install pyOpenSSL
and import it with the name
OpenSSL. This example
shows a couple of the functions available:
>>> import OpenSSL >>> >>> OpenSSL.crypto.get_elliptic_curve('Oakley-EC2N-3')<Curve'Oakley-EC2N-3'> >>> >>> OpenSSL.SSL.Context(OpenSSL.SSL.TLSv1_2_METHOD)<OpenSSL.SSL.Context object at 0x10d778ef0>
The pyOpenSSL team maintains example code that includes certificate generation, a way to start using SSL over an already-connected socket, and a secure XMLRPC server.
The idea behind libsodium, the C library backend for both PyNaCl and libnacl, is to intentionally not provide users with many choices—just the best one for their situation. It does not support all of the TLS protocol; if you want that, use pyOpenSSL. If all you want is an encrypted connection with some other computer you’re in control of, with your own protocols of your choosing, and you don’t want to deal with OpenSSL, then use this.7
Pronounce PyNaCl as “py-salt” and libnacl as “lib-salt”—they’re both derived from the NaCl (salt) library.
We recommend PyNaCl over
libnacl because it’s
under the PyCA umbrella, and you don’t have to install libsodium separately.
The libraries are virtually the same—PyNaCl uses CFFI bindings
for the C libraries, and libnacl uses ctypes—so it really doesn’t matter that much.
Install PyNaCl using pip:
$ pip install PyNaCl
And follow the PyNaCl examples in its documentation.
Cryptography provides cryptographic recipes and primitives. It supports Python 2.6–2.7, Python 3.3+, and PyPy. The PyCA recommends the higher-level interface in pyOpenSSL for most uses.
Cryptography is divided into two layers: recipes and hazardous materials
(hazmat). The recipes layer provides a simple API for proper symmetric
encryption, and the hazmat layer provides low-level cryptographic primitives.
Install it using pip:
$ pip install cryptography
This example uses a high-level symmetric encryption recipe—the only high-level function in this library:
fromcryptography.fernetimportFernetkey=Fernet.generate_key()cipher_suite=Fernet(key)cipher_text=cipher_suite.encrypt(b"A really secret message.")plain_text=cipher_suite.decrypt(cipher_text)
PyCrypto
provides secure hash functions and various encryption algorithms. It
supports Python version 2.1+ and Python 3+. Because the
C code is custom, the PyCA was wary of adopting it, but
it was also the de facto cryptography library for Python for years,
so you’ll see it in older code.
Install it using pip:
$ pip install pycrypto
And use it like this:
fromCrypto.CipherimportAES# Encryptionencryption_suite=AES.new('This is a key123',AES.MODE_CBC,'This is an IV456')cipher_text=encryption_suite.encrypt("A really secret message.")# Decryptiondecryption_suite=AES.new('This is a key123',AES.MODE_CBC,'This is an IV456')plain_text=decryption_suite.decrypt(cipher_text)
If you want to use the bcrypt algorithm for your
passwords, use this library. Previous users of
py-bcrypt should find it easy to transition, because it is compatible.
Install it using pip:
pip install bcrypt
It only has two functions: bcrypt.hashpw() and bcrypt.gensalt().
The latter lets you choose how many iterations to use—more iterations will make the algorithm slower
(it defaults to a reasonable number).
Here’s an example:
>>>importbcrypt>>>>>>>password=bytes('password','utf-8')>>>hashed_pw=bcrypt.hashpw(password,bcrypt.gensalt(14))>>>hashed_pwb'$2b$14$qAmVOCfEmHeC8Wd5BoF1W.7ny9M7CSZpOR5WPvdKFXDbkkX8rGJ.e'
We store the hashed password somewhere:
>>>importbinascii>>>hexed_hashed_pw=binascii.hexlify(hashed_pw)>>>store_password(user_id=42,password=hexed_hashed_pw)
and when it’s time to check the password, use the hashed password
as the second argument to bcrypt.hashpw() like this:
>>>hexed_hashed_pw=retieve_password(user_id=42)>>>hashed_pw=binascii.unhexlify(hexed_hashed_pw)>>>>>>bcrypt.hashpw(password,hashed_pw)b'$2b$14$qAmVOCfEmHeC8Wd5BoF1W.7ny9M7CSZpOR5WPvdKFXDbkkX8rGJ.e'>>>>>>bcrypt.hashpw(password,hashed_pw)==hashed_pwTrue
1 The TCP/IP (or Internet Protocol) suite has four conceptual parts: Link layer protocols specify how to get information between a computer and the Internet. Within the computer, they’re the responsibility of network cards and the operating system, not of the Python program. Internet layer protocols (IPv4, IPv6, etc.) govern the delivery of packages of bits from a source to a destination—the standard options are in Python’s socket library. Transport layer protocols (TCP, UDP, etc.) specify how the two endpoints will communicate. The options are also in the socket library. Finally, application layer protocols (FTP, HTTP, etc.) specify what the data should look like to be used by an intended application (e.g., FTP is used for file transfer, and HTTP is used for hypertext transfer)—Python’s Standard Library provides separate modules implementing the most common protocols.
2 A socket is three things: an IP address including port, a transport protocol (like TCP / UDP), and an I/O channel (some sort of file-like object). The Python documentation includes a great intro to sockets.
3 The queue doesn’t require an IP address or protocol, as it’s on the same computer—you just write some data to it and another process can read it. It’s like the multiprocessing.Queue, but here the I/O is done asynchronously.
4 The birth of the cryptography library, and some of the backstory for the motivation behind this new effort, is described in Jake Edge’s blog post “The state of crypto in Python.” The cryptography library it describes is a lower-level library, intended to be imported by higher-level libraries like pyOpenSSL that most of us would use. Edge quotes Jarret Raim and Paul Kehrer’s talk about the State of Crypto in Python, saying their test suite has over 66,000 tests, run 77 times per build.
5 A salt is a random string that further obfuscates the hash; if everyone used the same algorithm, a nefarious actor could generate a lookup table of common passwords and their hashes, and use them to “decode” stolen password files. So, to thwart this, people append a random string (a “salt”) to the password—they just also have to store that random string for future use.
6 Anybody can join the PyCA’s cryptography-dev listserv to keep up with development and other news…and the OpenSSL listserv for OpenSSL news.
7 If you’re paranoid, want to be able to audit 100% of your crypto code, don’t care that it’s a tad slow, and aren’t so interested in having the most current algorithms and defaults, try TweetNaCl, which is a single file crypto library that fits in 100 tweets. Because PyNaCl bundles libsodium in its release, you can probably just drop in TweetNaCl and still run most everything (however, we didn’t try this option).