This chapter covers most of the issues related to dealing with files and filesystems in Python. A file is a stream of text or bytes that a program can read and/or write; a filesystem is a hierarchical repository of files on a computer system.
Because files are such a crucial concept in programming, even though this chapter is the largest one in the book, several other chapters also contain material that is relevant when you’re handling specific kinds of files. In particular, Chapter 11 deals with many kinds of files related to persistence and database functionality (JSON files in “The json Module”, pickle files in “The pickle and cPickle Modules”, shelve files in “The shelve Module”, DBM and DBM-like files in “The v3 dbm Package”, and SQLite database files in “SQLite”), Chapter 22 deals with files and other streams in HTML format, and Chapter 23 deals with files and other streams in XML format.
Files and streams come in many flavors: their contents can be arbitrary bytes or text (with various encodings, if the underlying storage or channel deals only with bytes, as most do); they may be suitable for reading, writing, or both; they may or may not be buffered; they may or may not allow “random access,” going back and forth in the file (a stream whose underlying channel is an Internet socket, for example, only allows sequential, “going forward” access—there’s no “going back and forth”).
Traditionally, old-style Python coalesced most of this diverse functionality into the built-in file object, working in different ways depending on how the built-in function open created it. These built-ins are still in v2, for backward compatibility.
In both v2 and v3, however, input/output (I/O) is more logically structured, within the standard library’s io module. In v3, the built-in function open is, in fact, simply an alias for the function io.open. In v2, the built-in open still works the old-fashioned way, creating and returning an old-fashioned built-in file object (a type that does not exist anymore in v3). However, you can from io import open to use, instead, the new and better structured io.open: the file-like objects it returns operate quite similarly to the old-fashioned built-in file objects in most simple cases, and you can call the new function in a similar way to old-fashioned built-in function open, too. (Alternatively, of course, in both v2 and v3, you can practice the excellent principle “explicit is better than implicit” by doing import io and then explicitly using io.open.)
If you have to maintain old code using built-in file objects in complicated ways, and don’t want to port that code to the newer approach, use the online docs as a reference to the details of old-fashioned built-in file objects. This book (and, specifically, the start of this chapter) does not cover old-fashioned built-in file objects, just io.open and the various classes from the io module.
Immediately after that, this chapter covers the polymorphic concept of file-like objects (objects that are not actually files but behave to some extent like files) in “File-Like Objects and Polymorphism”.
The chapter next covers modules that deal with temporary files and file-like objects (tempfile in “The tempfile Module”, and io.StringIO and io.BytesIO in “In-Memory “Files”: io.StringIO and io.BytesIO”).
Next comes the coverage of modules that help you access the contents of text and binary files (fileinput in “The fileinput Module”, linecache in “The linecache Module”, and struct in “The struct Module”) and support compressed files and other data archives (gzip in “The gzip Module”, bz2 in “The bz2 Module”, tarfile in “The tarfile Module”, zipfile in “The zipfile Module”, and zlib in “The zlib Module”). v3 also supports LZMA compression, as used, for example, by the xz program: we don’t cover that issue in this book, but see the online docs and PyPI for a backport to v2.
In Python, the os module supplies many of the functions that operate on the filesystem, so this chapter continues by introducing the os module in “The os Module”. The chapter then covers, in “Filesystem Operations”, operations on the filesystem (comparing, copying, and deleting directories and files; working with file paths; and accessing low-level file descriptors) offered by os (in “File and Directory Functions of the os Module”), os.path (in “The os.path Module”), and other modules (dircache under listdir in Table 10-3, stat in “The stat Module”, filecmp in “The filecmp Module”, fnmatch in “The fnmatch Module”, glob in “The glob Module”, and shutil in “The shutil Module”). We do not cover the module pathlib, supplying an object-oriented approach to filesystem paths, since, as of this writing, it has been included in the standard library only on a provisional basis, meaning it can undergo backward-incompatible changes, up to and including removal of the module; if you nevertheless want to try it out, see the online docs, and PyPI for a v2 backport.
While most modern programs rely on a graphical user interface (GUI), often via a browser or a smartphone app, text-based, nongraphical “command-line” user interfaces are still useful, since they’re simple, fast to program, and lightweight. This chapter concludes with material about text input and output in Python in “Text Input and Output”, richer text I/O in “Richer-Text I/O”, interactive command-line sessions in “Interactive Command Sessions”, and, finally, a subject generally known as internationalization (often abbreviated i18n). Building software that processes text understandable to different users, across languages and cultures, is described in “Internationalization”.
As mentioned in “Organization of This Chapter”, io is a standard library module in Python and provides the most common ways for your Python programs to read or write files. Use io.open to make a Python “file” object—which, depending on what parameters you pass to io.open, can in fact be an instance of io.TextIOWrapper if textual, or, if binary, io.BufferedReader, io.BufferedWriter, or io.BufferedRandom, depending on whether it’s read-only, write-only, or read-write—to read and/or write data to a file as seen by the underlying operating system. We refer to these as “file” objects, in quotes, to distinguish them from the old-fashioned built-in file type still present in v2.
In v3, the built-in function open is a synonym for io.open. In v2, use from io import open to get the same effect. We use io.open explicitly (assuming a previous import io has executed, of course), for clarity and to avoid ambiguity.
This section covers such “file” objects, as well as the important issue of making and using temporary files (on disk, or even in memory).
Python reacts to any I/O error related to a “file” object by raising an instance of built-in exception class IOError (in v3, that’s a synonym for OSError, but many useful subclasses exist, and are covered in “OSError and subclasses (v3 only)”). Errors that cause this exception include open failing to open a file, calls to a method on a “file” to which that method doesn’t apply (e.g., calling write on a read-only “file,” or calling seek on a nonseekable file)—which could also cause ValueError or AttributeError—and I/O errors diagnosed by a “file” object’s methods.
The io module also provides the underlying web of classes, both abstract and concrete, that, by inheritance and by composition (also known as wrapping), make up the “file” objects (instances of classes mentioned in the first paragraph of this section) that your program generally uses. We do not cover these advanced topics in this book. If you have access to unusual channels for data, or nonfilesystem data storage, and want to provide a “file” interface to those channels or storage, you can ease your task, by appropriate subclassing and wrapping, using other classes in the module io. For such advanced tasks, consult the online docs.
To create a Python “file” object, call io.open with the following syntax:
open(file,mode='r',buffering=-1,encoding=None,errors='strict',newline=None,closefd=True,opener=os.open)
file can be a string, in which case it’s any path to a file as seen by the underlying OS, or it can be an integer, in which case it’s an OS-level file descriptor as returned by os.open (or, in v3 only, whatever function you pass as the opener argument—opener is not supported in v2). When file is a string, open opens the file thus named (possibly creating it, depending on mode—despite its name, open is not just for opening existing files: it can also create new ones); when file is an integer, the underlying OS file must already be open (via os.open or whatever).
open is a context manager: use with io.open(...) as f:, not f = io.open(...), to ensure the “file” f gets closed as soon as the with statement’s body is done.
open creates and returns an instance f of the appropriate class of the module io, depending on mode and buffering—we refer to all such instances as “file” objects; they all are reasonably polymorphic with respect to each other.
mode is a string indicating how the file is to be opened (or created). mode can be:
'r'The file must already exist, and it is opened in read-only mode.
'w'The file is opened in write-only mode. The file is truncated to zero length and overwritten if it already exists, or created if it does not exist.
'a'The file is opened in write-only mode. The file is kept intact if it already exists, and the data you write is appended to what’s already in the file. The file is created if it does not exist. Calling f.seek on the file changes the result of the method f.tell, but does not change the write position in the file.
'r+'The file must already exist and is opened for both reading and writing, so all methods of f can be called.
'w+'The file is opened for both reading and writing, so all methods of f can be called. The file is truncated and overwritten if it already exists, or created if it does not exist.
'a+'The file is opened for both reading and writing, so all methods of f can be called. The file is kept intact if it already exists, and the data you write is appended to what’s already in the file. The file is created if it does not exist. Calling f.seek on the file, depending on the underlying operating system, may have no effect when the next I/O operation on f writes data, but does work normally when the next I/O operation on f reads data.
The mode string may have any of the values just explained, followed by a b or t. b means a binary file, while t means a text one. When mode has neither b nor t, the default is text (i.e., 'r' is like 'rt', 'w' is like 'wt', and so on).
Binary files let you read and/or write strings of type bytes; text ones let you read and/or write Unicode text strings (str in v3, unicode in v2). For text files, when the underlying channel or storage system deals in bytes (as most do), encoding (the name of an encoding known to Python) and errors (an error-handler name such as 'strict', 'replace', and so on, as covered under decode in Table 8-6) matter, as they specify how to translate between text and bytes, and what to do on encoding and decoding errors.
buffering is an integer that denotes the buffering you’re requesting for the file. When buffering is less than 0, a default is used. Normally, this default is line buffering for files that correspond to interactive consoles, and a buffer of io.DEFAULT_BUFFER_SIZE bytes for other files. When buffering is 0, the file is unbuffered; the effect is as if the file’s buffer were flushed every time you write anything to the file. When buffering equals 1, the file (which must be text mode) is line-buffered, which means the file’s buffer is flushed every time you write \n to the file. When buffering is greater than 1, the file uses a buffer of about buffering bytes, rounded up to some reasonable amount.
A “file” object f is inherently sequential (a stream of bytes or text). When you read, you get bytes or text in the sequential order in which they’re present. When you write, the bytes or text you write are added in the order in which you write them.
To allow nonsequential access (also known as “random access”), a “file” object whose underlying storage allows this keeps track of its current position (the position in the underlying file where the next read or write operation starts transferring data). f.seekable() returns True when f supports nonsequential access.
When you open a file, the initial position is at the start of the file. Any call to f.write on a “file” object f opened with a mode of 'a' or 'a+' always sets f’s position to the end of the file before writing data to f. When you write or read n bytes to/from “file” object f, f’s position advances by n. You can query the current position by calling f.tell and change the position by calling f.seek, both covered in the next section.
f.tell and f.seek also work on a text-mode f, but in this case the offset you pass to f.seek must be 0 (to position f at the start or end, depending on f.seek’s second parameter), or the opaque result previously returned by a call to f.tell, to position f back to a position you had thus “bookmarked” before.
A “file” object f supplies the attributes and methods documented in this section.
|
close |
Closes the file. You can call no other method on |
|
closed |
|
|
encoding |
|
|
flush |
Requests that |
|
isatty |
|
|
fileno |
Returns an integer, the file descriptor of |
|
mode |
|
|
name |
|
|
read |
In v2, or in v3 when |
|
readline |
Reads and returns one line from |
|
readlines |
Reads and returns a list of all lines in |
|
seek |
Sets When When |
|
tell |
Returns |
|
truncate |
Truncates |
|
write |
Writes the bytes of string |
|
writelines |
Like:
It does not matter whether the strings in iterable |
A “file” object f, open for text-mode reading, is also an iterator whose items are the file’s lines. Thus, the loop:
forlineinf:
iterates on each line of the file. Due to buffering issues, interrupting such a loop prematurely (e.g., with break), or calling next(f) instead of f.readline(), leaves the file’s position set to an arbitrary value. If you want to switch from using f as an iterator to calling other reading methods on f, be sure to set the file’s position to a known value by appropriately calling f.seek. On the plus side, a loop directly on f has very good performance, since these specifications allow the loop to use internal buffering to minimize I/O without taking up excessive amounts of memory even for huge files.
An object x is file-like when it behaves polymorphically to a “file” object as returned by io.open, meaning that we can use x “as if” x were a “file.” Code using such an object (known as client code of the object) usually gets the object as an argument, or by calling a factory function that returns the object as the result. For example, if the only method that client code calls on x is x.read(), without arguments, then all x needs to supply in order to be file-like for that code is a method read that is callable without arguments and returns a string. Other client code may need x to implement a larger subset of file methods. File-like objects and polymorphism are not absolute concepts: they are relative to demands placed on an object by some specific client code.
Polymorphism is a powerful aspect of object-oriented programming, and file-like objects are a good example of polymorphism. A client-code module that writes to or reads from files can automatically be reused for data residing elsewhere, as long as the module does not break polymorphism by the dubious practice of type checking. When we discussed built-ins type and isinstance in Table 7-1, we mentioned that type checking is often best avoided, as it blocks the normal polymorphism that Python otherwise supplies. Most often, to support polymorphism in your client code, all you have to do is avoid type checking.
You can implement a file-like object by coding your own class (as covered in Chapter 4) and defining the specific methods needed by client code, such as read. A file-like object fl need not implement all the attributes and methods of a true “file” object f. If you can determine which methods the client code calls on fl, you can choose to implement only that subset. For example, when fl is only going to be written, fl doesn’t need “reading” methods, such as read, readline, and readlines.
If the main reason you want a file-like object instead of a real file object is to keep the data in memory, use the io module’s classes StringIO and BytesIO, covered in “In-Memory “Files”: io.StringIO and io.BytesIO”. These classes supply “file” objects that hold data in memory and largely behave polymorphically to other “file” objects.
The tempfile module lets you create temporary files and directories in the most secure manner afforded by your platform. Temporary files are often a good solution when you’re dealing with an amount of data that might not comfortably fit in memory, or when your program must write data that another process later uses.
The order of the parameters for the functions in this module is a bit confusing: to make your code more readable, always call these functions with named-argument syntax. The tempfile module exposes the functions and classes outlined in Table 10-1.
|
mkdtemp |
Securely creates a new temporary directory that is readable, writable, and searchable only by the current user, and returns the absolute path to the temporary directory. The optional arguments
|
|
mkstemp |
Securely creates a new temporary file, readable and writable only by the current user, not executable, not inherited by subprocesses; returns a pair Ensuring that the temporary file is removed when you’re done using it is up to you:
|
|
SpooledTemporaryFile |
Just like |
|
TemporaryFile |
Creates a temporary file with |
|
NamedTemporaryFile |
Like |
“File” objects supply the minimal functionality needed for file I/O. Some auxiliary Python library modules, however, offer convenient supplementary functionality, making I/O even easier and handier in several important cases.
The fileinput module lets you loop over all the lines in a list of text files. Performance is good, comparable to the performance of direct iteration on each file, since buffering is used to minimize I/O. You can therefore use module fileinput for line-oriented file input whenever you find the module’s rich functionality convenient, with no worry about performance. The input function is the key function of module fileinput; the module also supplies a FileInput class whose methods support the same functionality as the module’s functions. The module contents are listed here:
|
close |
Closes the whole sequence so that iteration stops and no file remains open. |
|
FileInput |
Creates and returns an instance |
|
filelineno |
Returns the number of lines read so far from the file now being read. For example, returns |
|
filename |
Returns the name of the file being read, or |
|
input |
Returns the sequence of lines in the files, suitable for use in a The sequence object that When
You can optionally pass an |
|
isfirstline |
Returns |
|
isstdin |
Returns |
|
lineno |
Returns the total number of lines read since the call to |
|
nextfile |
Closes the file being read: the next line to read is the first one of the next file. |
Here’s a typical example of using fileinput for a “multifile search and replace,” changing one string into another throughout the text files whose name were passed as command-line arguments to the script:
importfileinputforlineinfileinput.input(inplace=True):(line.replace('foo','bar'),end='')
In such cases it’s important to have the end='' argument to print, since each line has its line-end character \n at the end, and you need to ensure that print doesn’t add another (or else each file would end up “double-spaced”).
The linecache module lets you read a given line (specified by number) from a file with a given name, keeping an internal cache so that, when you read several lines from a file, it’s faster than opening and examining the file each time. The linecache module exposes the following functions:
|
checkcache |
Ensures that the module’s cache holds no stale data and reflects what’s on the filesystem. Call |
|
clearcache |
Drops the module’s cache so that the memory can be reused for other purposes. Call |
|
getline |
Reads and returns the |
|
getlines |
Reads and returns all lines from the text file named |
The struct module lets you pack binary data into a bytestring, and unpack the bytes of such a bytestring back into the data they represent. Such operations are useful for many kinds of low-level programming. Most often, you use struct to interpret data records from binary files that have some specified format, or to prepare records to write to such binary files. The module’s name comes from C’s keyword struct, which is usable for related purposes. On any error, functions of the module struct raise exceptions that are instances of the exception class struct.error, the only class the module supplies.
The struct module relies on struct format strings following a specific syntax. The first character of a format string gives byte order, size, and alignment of packed data:
@Native byte order, native data sizes, and native alignment for the current platform; this is the default if the first character is none of the characters listed here (note that format P in Table 10-2 is available only for this kind of struct format string). Look at string sys.byteorder when you need to check your system’s byte order ('little' or 'big').
=Native byte order for the current platform, but standard size and alignment.
<Little-endian byte order (like Intel platforms); standard size and alignment.
>, !Big-endian byte order (network standard); standard size and alignment.
| Character | C type | Python type | Standard size |
|---|---|---|---|
|
|
|
|
1 byte |
|
|
|
|
1 byte |
|
|
|
|
1 byte |
|
|
|
|
8 bytes |
|
|
|
|
4 bytes |
|
|
|
|
2 bytes |
|
|
|
|
2 bytes |
|
|
|
|
4 bytes |
|
|
|
|
4 bytes |
|
|
|
|
4 bytes |
|
|
|
|
4 bytes |
|
|
|
|
N/A |
|
|
|
|
N/A |
|
|
|
|
N/A |
|
|
|
no value |
1 byte |
Standard sizes are indicated in Table 10-2. Standard alignment means no forced alignment, with explicit padding bytes used if needed. Native sizes and alignment are whatever the platform’s C compiler uses. Native byte order can put the most significant byte at either the lowest (big-endian) or highest (little-endian) address, depending on the platform.
After the optional first character, a format string is made up of one or more format characters, each optionally preceded by a count (an integer represented by decimal digits). (The format characters are shown in Table 10-2.) For most format characters, the count means repetition (e.g., '3h' is exactly the same as 'hhh'). When the format character is s or p—that is, a bytestring—the count is not a repetition: it’s the total number of bytes in the string. Whitespace can be freely used between formats, but not between a count and its format character.
Format s means a fixed-length bytestring as long as its count (the Python string is truncated, or padded with copies of the null byte b'\0', if needed). The format p means a “Pascal-like” bytestring: the first byte is the number of significant bytes that follow, and the actual contents start from the second byte. The count is the total number of bytes, including the length byte.
The struct module supplies the following functions:
You can implement file-like objects by writing Python classes that supply the methods you need. If all you want is for data to reside in memory, rather than on a file as seen by the operating system, use the class StringIO or BytesIO of the io module. The difference between them is that instances of StringIO are text-mode “files,” so reads and writes consume or produce Unicode strings, while instances of BytesIO are binary “files,” so reads and writes consume or produce bytestrings.
When you instantiate either class you can optionally pass a string argument, respectively Unicode or bytes, to use as the initial content of the “file.” An instance f of either class, in addition to “file” methods, supplies one extra method:
|
getvalue |
Returns the current data contents of |
Storage space and transmission bandwidth are increasingly cheap and abundant, but in many cases you can save such resources, at the expense of some extra computational effort, by using compression. Computational power grows cheaper and more abundant even faster than some other resources, such as bandwidth, so compression’s popularity keeps growing. Python makes it easy for your programs to support compression, since the Python standard library contains several modules dedicated to compression.
Since Python offers so many ways to deal with compression, some guidance may be helpful. Files containing data compressed with the zlib module are not automatically interchangeable with other programs, except for those files built with the zipfile module, which respects the standard format of ZIP file archives. You can write custom programs, with any language able to use InfoZip’s free zlib compression library, to read files produced by Python programs using the zlib module. However, if you need to interchange compressed data with programs coded in other languages but have a choice of compression methods, we suggest you use the modules bzip2 (best), gzip, or zipfile instead. The zlib module, however, may be useful when you want to compress some parts of datafiles that are in some proprietary format of your own and need not be interchanged with any other program except those that make up your application.
In v3 only, you can also use the newer module lzma for even (marginally) better compression and compatibility with the newer xz utility. We do not cover lzma in this book; see the online docs, and, for use in v2, the v2 backport.
The gzip module lets you read and write files compatible with those handled by the powerful GNU compression programs gzip and gunzip. The GNU programs support many compression formats, but the module gzip supports only the gzip format, often denoted by appending the extension .gz to a filename. The gzip module supplies the GzipFile class and an open factory function:
|
GzipFile |
Creates and returns a file-like object
The file-like object |
|
open |
Like |
Say that you have some function f(x) that writes Unicode text to a text file object x passed in as an argument, by calling x.write and/or x.writelines. To make f write text to a gzip-compressed file instead:
importgzip,iowithio.open('x.txt.gz','wb')asunderlying:withgzip.GzipFile(fileobj=underlying,mode='wb')aswrapper:f(io.TextIOWrapper(wrapper,'utf8'))
This example opens the underlying binary file x.txt.gz and explicitly wraps it with gzip.GzipFile; thus, we need two nested with statements. This separation is not strictly necessary: we could pass the filename directly to gzip.GzipFile (or gzip.open); in v3 only, with gzip.open, we could even ask for mode='wt' and have the TextIOWrapper transparently provided for us. However, the example is coded to be maximally explicit, and portable between v2 and v3.
Reading back a compressed text file—for example, to display it on standard output—uses a pretty similar structure of code:
importgzip,iowithio.open('x.txt.gz','rb')asunderlying:withgzip.GzipFile(fileobj=underlying,mode='rb')aswrapper:forlineinwrapper:(line.decode('utf8'),end='')
Here, we can’t just use an io.TextIOWrapper, since, in v2, it would not be iterable by line, given the characteristics of the underlying (decompressing) wrapper. However, the explicit decode of each line works fine in both v2 and v3.
The bz2 module lets you read and write files compatible with those handled by the compression programs bzip2 and bunzip2, which often achieve even better compression than gzip and gunzip. Module bz2 supplies the BZ2File class, for transparent file compression and decompression, and functions compress and decompress to compress and decompress data strings in memory. It also provides objects to compress and decompress data incrementally, enabling you to work with data streams that are too large to comfortably fit in memory at once. For the latter, advanced functionality, consult the Python standard library’s online docs.
For richer functionality in v2, consider the third-party module bz2file, with more complete features, matching v3’s standard library module’s ones as listed here:
|
BZ2File |
Creates and returns a file-like object
|
|
compress |
Compresses string |
|
decompress |
Decompresses the compressed data string |
The tarfile module lets you read and write TAR files (archive files compatible with those handled by popular archiving programs such as tar), optionally with either gzip or bzip2 compression (and, in v3 only, lzma too). In v3 only, python -m tarfile offers a useful command-line interface to the module’s functionality: run it without further arguments to get a brief help message.
When handling invalid TAR files, functions of tarfile raise instances of tarfile.TarError. The tarfile module supplies the following classes and functions:
|
is_tarfile |
Returns |
|
TarInfo |
The methods
To check the type of
|
|
open |
Creates and returns a
In the mode strings specifying compression, you can use a vertical bar A |
|
add |
Adds to archive |
|
addfile |
Adds to archive |
|
close |
Closes archive |
|
extract |
Extracts the archive member identified by |
|
extractfile |
Extracts the archive member identified by |
|
getmember |
Returns a |
|
getmembers |
Returns a list of |
|
getnames |
Returns a list of strings, the names of each member in archive |
|
gettarinfo |
Returns a |
|
list |
Outputs a directory of the archive |
The zipfile module can read and write ZIP files (i.e., archive files compatible with those handled by popular compression programs such as zip and unzip, pkzip and pkunzip, WinZip, and so on). python -m zipfile offers a useful command-line interface to the module’s functionality: run it without further arguments to get a brief help message.
Detailed information about ZIP files is at the pkware and info-zip web pages. You need to study that detailed information to perform advanced ZIP file handling with zipfile. If you do not specifically need to interoperate with other programs using the ZIP file standard, the modules gzip and bz2 are often better ways to deal with file compression.
The zipfile module can’t handle multidisk ZIP files, and cannot create encrypted archives (however, it can decrypt them, albeit rather slowly). The module also cannot handle archive members using compression types besides the usual ones, known as stored (a file copied to the archive without compression) and deflated (a file compressed using the ZIP format’s default algorithm). (In v3 only, zipfile also handles compression types bzip2 and lzma, but beware: not all tools, including v2’s zipfile, can handle those, so if you use them you’re sacrificing some portability to get better compression.) For errors related to invalid .zip files, functions of zipfile raise exceptions that are instances of the exception class zipfile.error.
The zipfile module supplies the following classes and functions:
|
is_zipfile |
Returns |
|
ZipInfo |
The methods
|
|
ZipFile |
Opens a ZIP file named by string When
When A In addition,
|
A ZipFile instance z supplies the following methods:
|
close |
Closes archive file |
|
extract |
Extract an archive member to disk, to the directory Returns the path to the file it has created (or overwritten if it already existed), or to the directory it has created (or left alone if it already existed). |
|
extractall |
Extract archive members to disk (by default, all of them), to directory |
|
getinfo |
Returns a |
|
infolist |
Returns a list of |
|
namelist |
Returns a list of strings, the name of each member in archive |
|
open |
Extracts and returns the archive member identified by |
|
printdir |
Outputs a textual directory of the archive |
|
read |
Extracts the archive member identified by |
|
setpassword |
Sets string |
|
testzip |
Reads and checks the files in archive |
|
write |
Writes the file named by string |
|
writestr |
Besides being faster and more concise, Here’s how you can print a list of all files contained in the ZIP file archive created by the previous example, followed by each file’s name and contents:
|
The zlib module lets Python programs use the free InfoZip zlib compression library version 1.1.4 or later. zlib is used by the modules gzip and zipfile, but is also available directly for any special compression needs. The most commonly used functions supplied by zlib are:
|
compress |
Compresses string |
|
decompress |
Decompresses the compressed bytestring |
The zlib module also supplies functions to compute Cyclic-Redundancy Check (CRC), which allows detection of damage in compressed data, as well as objects to compress and decompress data incrementally to allow working with data streams too large to fit in memory at once. For such advanced functionality, consult the Python library’s online docs.
os is an umbrella module presenting a reasonably uniform cross-platform view of the capabilities of various operating systems. It supplies low-level ways to create and handle files and directories, and to create, manage, and destroy processes. This section covers filesystem-related functions of os; “Running Other Programs with the os Module” covers process-related functions.
The os module supplies a name attribute, a string that identifies the kind of platform on which Python is being run. Common values for name are 'posix' (all kinds of Unix-like platforms, including Linux and macOS), 'nt' (all kinds of 32-bit Windows platforms), and 'java' (Jython). You can exploit some unique capabilities of a platform through functions supplied by os. However, this book deals with cross-platform programming, not with platform-specific functionality, so we do not cover parts of os that exist only on one platform, nor platform-specific modules. Functionality covered in this book is available at least on 'posix' and 'nt' platforms. We do, though, cover some of the differences among the ways in which a given functionality is provided on various platforms.
When a request to the operating system fails, os raises an exception, an instance of OSError. os also exposes built-in exception class OSError with the synonym os.error. Instances of OSError expose three useful attributes:
errnoThe numeric error code of the operating system error
strerrorA string that summarily describes the error
filenameThe name of the file on which the operation failed (file-related functions only)
In v3 only, OSError has many subclasses that specify more precisely what problem was encountered, as covered in “OSError and subclasses (v3 only)”.
os functions can also raise other standard exceptions, such as TypeError or ValueError, when the cause of the error is that you have called them with invalid argument types or values, so that the underlying operating system functionality has not even been attempted.
The errno module supplies dozens of symbolic names for error code numbers. To handle possible system errors selectively, based on error codes, use errno to enhance your program’s portability and readability. For example, here’s how you might handle “file not found” errors, while propagating all other kinds of errors (when you want your code to work as well in v2 as in v3):
try:os.some_os_function_or_other()exceptOSErroraserr:importerrno# check for "file not found" errors, re-raise other casesiferr.errno!=errno.ENOENT:raise# proceed with the specific case you can handle('Warning: file',err.filename,'not found—continuing')
If you’re coding for v3 only, however, you can make an equivalent snippet much simpler and clearer, by catching just the applicable OSError subclass:
try:os.some_os_function_or_other()exceptFileNotFoundErroraserr:('Warning: file',err.filename,'not found—continuing')
errno also supplies a dictionary named errorcode: the keys are error code numbers, and the corresponding names are the error names, which are strings such as 'ENOENT'. Displaying errno.errorcode[err.errno], as part of your diagnosis of some OSError instance err, can often make the diagnosis clearer and more understandable to readers who specialize in the specific platform.
Using the os module, you can manipulate the filesystem in a variety of ways: creating, copying, and deleting files and directories; comparing files; and examining filesystem information about files and directories. This section documents the attributes and methods of the os module that you use for these purposes, and covers some related modules that operate on the filesystem.
A file or directory is identified by a string, known as its path, whose syntax depends on the platform. On both Unix-like and Windows platforms, Python accepts Unix syntax for paths, with a slash (/) as the directory separator. On non-Unix-like platforms, Python also accepts platform-specific path syntax. On Windows, in particular, you may use a backslash (\) as the separator. However, you then need to double-up each backslash as \\ in string literals, or use raw-string syntax as covered in “Literals”; you needlessly lose portability. Unix path syntax is handier and usable everywhere, so we strongly recommend that you always use it. In the rest of this chapter, we assume Unix path syntax in both explanations and examples.
The os module supplies attributes that provide details about path strings on the current platform. You should typically use the higher-level path manipulation operations covered in “The os.path Module” rather than lower-level string operations based on these attributes. However, the attributes may be useful at times.
curdirThe string that denotes the current directory ('.' on Unix and Windows)
defpathThe default search path for programs, used if the environment lacks a PATH environment variable
linesepThe string that terminates text lines ('\n' on Unix; '\r\n' on Windows)
extsepThe string that separates the extension part of a file’s name from the rest of the name ('.' on Unix and Windows)
pardirThe string that denotes the parent directory ('..' on Unix and Windows)
pathsepThe separator between paths in lists of paths, such as those used for the environment variable PATH (':' on Unix; ';' on Windows)
sepThe separator of path components ('/' on Unix; '\\' on Windows)
Unix-like platforms associate nine bits with each file or directory: three each for the file’s owner, its group, and anybody else (AKA “the world”), indicating whether the file or directory can be read, written, and executed by the given subject. These nine bits are known as the file’s permission bits, and are part of the file’s mode (a bit string that includes other bits that describe the file). These bits are often displayed in octal notation, since that groups three bits per digit. For example, mode 0o664 indicates a file that can be read and written by its owner and group, and read—but not written—by anybody else. When any process on a Unix-like system creates a file or directory, the operating system applies to the specified mode a bit mask known as the process’s umask, which can remove some of the permission bits.
Non-Unix-like platforms handle file and directory permissions in very different ways. However, the os functions that deal with file permissions accept a mode argument according to the Unix-like approach described in the previous paragraph. Each platform maps the nine permission bits in a way appropriate for it. For example, on versions of Windows that distinguish only between read-only and read/write files and do not distinguish file ownership, a file’s permission bits show up as either 0o666 (read/write) or 0o444 (read-only). On such a platform, when creating a file, the implementation looks only at bit 0o200, making the file read/write when that bit is 1, read-only when 0.
The os module supplies several functions to query and set file and directory status. In all versions and platforms, the argument path to any of these functions can be a string giving the path of the file or directory involved. In v3 only, on some Unix platforms, some of the functions also support as argument path a file descriptor (AKA fd), an int denoting a file (as returned, for example, by os.open). In this case, the module attribute os.supports_fd is the set of functions of the os module that do support a file descriptor as argument path (the module attribute is missing in v2, and in v3 on platforms lacking such support).
In v3 only, on some Unix platforms, some functions support the optional, keyword-only argument follow_symlinks, defaulting to True. When true, and always in v2, if path indicates a symbolic link, the function follows it to reach an actual file or directory; when false, the function operates on the symbolic link itself. The module attribute os.supports_follow_symlinks, if present, is the set of functions of the os module that do support this argument.
In v3 only, on some Unix platforms, some functions support the optional, keyword-only argument dir_fd, defaulting to None. When present, path (if relative) is taken as being relative to the directory open at that file descriptor; when missing, and always in v2, path (if relative) is taken as relative to the current working directory. The module attribute os.supports_dir_fd, if present, is the set of functions of the os module that do support this argument.
|
access |
Returns
In v3 only, |
|||||||||||||||||||||||||||||||||
|
chdir |
Sets the current working directory of the process to |
|||||||||||||||||||||||||||||||||
|
chmod |
Changes the permissions of the file |
|||||||||||||||||||||||||||||||||
|
getcwd |
Returns a |
|||||||||||||||||||||||||||||||||
|
link |
Create a hard link named |
|||||||||||||||||||||||||||||||||
|
listdir |
Returns a list whose items are the names of all files and subdirectories in the directory The v2-only |
|||||||||||||||||||||||||||||||||
|
makedirs, mkdir |
|
|||||||||||||||||||||||||||||||||
|
remove, unlink |
Removes the file named |
|||||||||||||||||||||||||||||||||
|
removedirs |
Loops from right to left over the directories that are part of |
|||||||||||||||||||||||||||||||||
|
rename |
Renames (i.e., moves) the file or directory named |
|||||||||||||||||||||||||||||||||
|
renames |
Like |
|||||||||||||||||||||||||||||||||
|
rmdir |
Removes the empty directory named |
|||||||||||||||||||||||||||||||||
|
scandir |
Returns an iterator over
|
|||||||||||||||||||||||||||||||||
|
stat |
Returns a value
For example, to print the size in bytes of file
Time values are in seconds since the epoch, as covered in Chapter 12 ( |
|||||||||||||||||||||||||||||||||
|
tempnam, tmpnam |
Returns an absolute path usable as the name of a new temporary file. Note: |
|||||||||||||||||||||||||||||||||
|
utime |
Sets the accessed and modified times of file or directory |
|||||||||||||||||||||||||||||||||
|
walk |
A generator yielding an item for each directory in the tree whose root is directory Each item By default, |
The os.path module supplies functions to analyze and transform path strings. To use this module, you can import os.path; however, even if you just import os, you can also access the os.path module and all of its attributes. The most commonly useful functions from the module are listed here:
|
abspath |
Returns a normalized absolute path string equivalent to
For example, |
|
basename |
Returns the base name part of |
|
commonprefix |
Accepts a list of strings and returns the longest string that is a prefix of all items in the list. Unlike all other functions in In v3 only, function |
|
dirname |
Returns the directory part of |
|
exists, |
Returns |
|
expandvars, |
Returns a copy of string
emits
|
|
getatime, getmtime, getctime, getsize |
Each of these functions returns an attribute from the result of |
|
isabs |
Returns |
|
isfile |
Returns |
|
isdir |
Returns |
|
islink |
Returns |
|
ismount |
Returns |
|
join |
Returns a string that joins the argument strings with the appropriate path separator for the current platform. For example, on Unix, exactly one slash character
The second call to |
|
normcase |
Returns a copy of |
|
normpath |
Returns a normalized pathname equivalent to |
|
realpath |
Returns the actual path of the specified file or directory, resolving symlinks along the way. |
|
relpath |
Returns a relative path to the specified file or directory, relative to directory |
|
samefile |
Returns |
|
sameopenfile |
Returns |
|
samestat |
Returns |
|
split |
Returns a pair of strings |
|
splitdrive |
Returns a pair of strings |
|
splitext |
Returns a pair |
|
walk |
(v2 only) Calls |
The function os.stat (covered in Table 10-4) returns instances of stat_result, whose item indices, attribute names, and meaning are also covered there. The stat module supplies attributes with names like those of stat_result’s attributes, turned into uppercase, and corresponding values that are the corresponding item indices.
The more interesting contents of the stat module are functions to examine the st_mode attribute of a stat_result instance and determine the kind of file. os.path also supplies functions for such tasks, which operate directly on the file’s path. The functions supplied by stat shown in the following list are faster when you perform several tests on the same file: they require only one os.stat call at the start of a series of tests, while the functions in os.path implicitly ask the operating system for the same information at each test. Each function returns True when mode denotes a file of the given kind; otherwise, False.
S_ISDIR(mode)Is the file a directory?
S_ISCHR(mode)Is the file a special device-file of the character kind?
S_ISBLK(mode)Is the file a special device-file of the block kind?
S_ISREG(mode)Is the file a normal file (not a directory, special device-file, and so on)?
S_ISFIFO(mode)Is the file a FIFO (also known as a “named pipe”)?
S_ISLNK(mode)Is the file a symbolic link?
S_ISSOCK(mode)Is the file a Unix-domain socket?
Several of these functions are meaningful only on Unix-like systems, since other platforms do not keep special files such as devices and sockets in the namespace for regular files, as Unix-like systems do.
The stat module supplies two more functions that extract relevant parts of a file’s mode (x.st_mode, for some result x of function os.stat):
|
S_IFMT |
Returns those bits of |
|
S_IMODE |
Returns those bits of |
The filecmp module supplies the following functions to compare files and directories:
|
cmp |
Compares the files named by path strings |
|
cmpfiles |
Loops on the sequence |
|
dircmp |
Creates a new directory-comparison instance object, comparing directories named A
In addition, |
A dircmp instance d supplies several attributes, computed “just in time” (i.e., only if and when needed, thanks to a __getattr__ special method) so that using a dircmp instance suffers no unnecessary overhead:
d.commonFiles and subdirectories that are in both dir1 and dir2
d.common_dirsSubdirectories that are in both dir1 and dir2
d.common_filesFiles that are in both dir1 and dir2
d.common_funnyNames that are in both dir1 and dir2 for which os.stat reports an error or returns different kinds for the versions in the two directories
d.diff_filesFiles that are in both dir1 and dir2 but with different contents
d.funny_filesFiles that are in both dir1 and dir2 but could not be compared
d.left_listFiles and subdirectories that are in dir1
d.left_onlyFiles and subdirectories that are in dir1 and not in dir2
d.right_listFiles and subdirectories that are in dir2
d.right_onlyFiles and subdirectories that are in dir2 and not in dir1
d.same_filesFiles that are in both dir1 and dir2 with the same contents
d.subdirsA dictionary whose keys are the strings in common_dirs; the corresponding values are instances of dircmp for each subdirectory
The fnmatch module (an abbreviation for filename match) matches filename strings with patterns that resemble the ones used by Unix shells:
Matches any sequence of characters
Matches any single character
Matches any one of the characters in chars
Matches any one character not among those in chars
fnmatch does not follow other conventions of Unix shells’ pattern matching, such as treating a slash / or a leading dot . specially. It also does not allow escaping special characters: rather, to literally match a special character, enclose it in brackets. For example, to match a filename that’s a single closed bracket, use the pattern '[]]'.
The fnmatch module supplies the following functions:
|
filter |
Returns the list of items of |
|
fnmatch |
Returns |
|
fnmatchcase |
Returns |
|
translate |
Returns the regular expression pattern (as covered in “Pattern-String Syntax”) equivalent to the
|
The glob module lists (in arbitrary order) the path names of files that match a path pattern using the same rules as fnmatch; in addition, it does treat a leading dot . specially, like Unix shells do.
|
glob |
Returns the list of path names of files that match pattern |
|
iglob |
Like |
The shutil module (an abbreviation for shell utilities) supplies the following functions to copy and move files, and to remove an entire directory tree. In v3 only, on some Unix platforms, most of the functions support optional, keyword-only argument follow_symlinks, defaulting to True. When true, and always in v2, if a path indicates a symbolic link, the function follows it to reach an actual file or directory; when false, the function operates on the symbolic link itself.
|
copy |
Copies the contents of the file |
|
copy2 |
Like |
|
copyfile |
Copies just the contents (not permission bits, nor last-access and modification times) of the file |
|
copyfileobj |
Copies all bytes from the “file” object |
|
copymode |
Copies permission bits of the file or directory |
|
copystat |
Copies permission bits and times of last access and modification of the file or directory |
|
copytree |
Copies the directory tree rooted at When When
copies the tree rooted at directory |
|
ignore_patterns |
Returns a callable picking out files and subdirectories matching |
|
move |
Moves the file or directory |
|
rmtree |
Removes directory tree rooted at |
Beyond offering functions that are directly useful, the source file shutil.py in the standard Python library is an excellent example of how to use many os functions.
The os module supplies, among many others, many functions to handle file descriptors, integers the operating system uses as opaque handles to refer to open files. Python “file” objects (covered in “The io Module”) are usually better for I/O tasks, but sometimes working at file-descriptor level lets you perform some operation faster, or (sacrificing portability) in ways not directly available with io.open. “File” objects and file descriptors are not interchangeable.
To get the file descriptor n of a Python “file” object f, call n=f.fileno(). To wrap a new Python “file” object f around an open file descriptor fd, call f=os.fdopen(fd), or pass fd as the first argument of io.open. On Unix-like and Windows platforms, some file descriptors are pre-allocated when a process starts: 0 is the file descriptor for the process’s standard input, 1 for the process’s standard output, and 2 for the process’s standard error.
os provides many functions dealing with file descriptors; the most often used ones are listed in Table 10-5.
|
close |
Closes file descriptor |
|
closerange |
Closes all file descriptors from |
|
dup |
Returns a file descriptor that duplicates file descriptor |
|
dup2 |
Duplicates file descriptor |
|
fdopen |
Like |
|
fstat |
Returns a |
|
lseek |
Sets the current position of file descriptor |
|
open |
Returns a file descriptor, opening or creating a file named by string
|
|
pipe |
Creates a pipe and returns a pair of file descriptors |
|
read |
Reads up to |
|
write |
Writes all bytes from bytestring |
Python presents non-GUI text input and output channels to your programs as “file” objects, so you can use the methods of “file” objects (covered in “Attributes and Methods of “file” Objects”) to operate on these channels.
The sys module (covered in “The sys Module”) has the attributes stdout and stderr, writeable “file” objects. Unless you are using shell redirection or pipes, these streams connect to the “terminal” running your script. Nowadays, actual terminals are very rare: a so-called “terminal” is generally a screen window that supports text I/O (e.g., a command prompt console on Windows or an xterm window on Unix).
The distinction between sys.stdout and sys.stderr is a matter of convention. sys.stdout, known as standard output, is where your program emits results. sys.stderr, known as standard error, is where error messages go. Separating results from error messages helps you use shell redirection effectively. Python respects this convention, using sys.stderr for errors and warnings.
Programs that output results to standard output often need to write to sys.stdout. Python’s print function (covered in Table 7-2) can be a rich, convenient alternative to sys.stdout.write. (In v2, start your module with from __future__ import print_function to make print a function—otherwise, for backward compatibility, it’s a less-convenient statement.)
print is fine for the informal output used during development to help you debug your code. For production output, you may need more control of formatting than print affords. You may need to control spacing, field widths, number of decimals for floating-point, and so on. If so, prepare the output as a string with string-formatting method format (covered in “String Formatting”), then output the string, usually with the write method of the appropriate “file” object. (You can pass formatted strings to print, but print may add spaces and newlines; the write method adds nothing at all, so it’s easier for you to control what exactly gets output.)
To direct the output from print calls to a certain file, as an alternative to repeated use of file=destination on each print, you can temporarily change the value of sys.stdout. The following example is a general-purpose redirection function usable for such a temporary change; in the presence of asynchronous operations, make sure to also add a lock in order to avoid any contention:
defredirect(func,*args,**kwds):"""redirect(func, ...) -> (string result, func's return value)func must be a callable and may emit results to standard output.redirect captures those results as a string and returns a pair,with the output string as the first item and func's return valueas the second one."""importsys,iosave_out=sys.stdoutsys.stdout=io.StringIO()try:retval=func(*args,**kwds)returnsys.stdout.getvalue(),retvalfinally:sys.stdout.close()sys.stdout=save_out
To output a few text values to a file object f that isn’t the current sys.stdout, avoid such manipulations. For such simple purposes, just calling f.write is often best, and print(file=f,...) is, even more often, a handy alternative.
The sys module provides the stdin attribute, which is a readable “file” object. When you need a line of text from the user, you can call built-in function input (covered in Table 7-2; in v2, it’s named raw_input), optionally with a string argument to use as a prompt.
When the input you need is not a string (for example, when you need a number), use input to obtain a string from the user, then other built-ins, such as int, float, or ast.literal_eval (covered below), to turn the string into the number you need.
You could, in theory, also use eval (normally preceded by compile, for better control of error diagnostics), so as to let the user input any expression, as long as you totally trust the user. A nasty user can exploit eval to breach security and cause damage (a well-meaning but careless user can also unfortunately cause just about as much damage). There is no effective defense—just avoid eval (and exec!) on any input from sources you do not fully trust.
One advanced alternative we do recommend is to use the function literal_eval from the standard library module ast (as covered in the online docs). ast.literal_eval(astring) returns a valid Python value for the given literal astring when it can, or else raises a SyntaxError or ValueError; it never has any side effect. However, to ensure complete safety, astring in this case cannot use any operator, nor any nonkeyword identifier. For example:
importast(ast.literal_eval('23'))# prints23(ast.literal_eval('[2,3]'))# prints[2, 3](ast.literal_eval('2+3'))# raises ValueError(ast.literal_eval('2+'))# raises SyntaxError
Very occasionally, you may want the user to input a line of text in such a way that somebody looking at the screen cannot see what the user is typing. This may occur when you’re asking the user for a password. The getpass module provides the following functions:
The tools covered so far supply the minimal subset of text I/O functionality on all platforms. Most platforms offer richer-text I/O, such as responding to single keypresses (not just entire lines) and showing text in any spot on the terminal.
Python extensions and core Python modules let you access platform-specific functionality. Unfortunately, various platforms expose this functionality in very different ways. To develop cross-platform Python programs with rich-text I/O functionality, you may need to wrap different modules uniformly, importing platform-specific modules conditionally (usually with the try/except idiom covered in “try/except”).
The readline module wraps the GNU Readline Library. Readline lets the user edit text lines during interactive input, and recall previous lines for editing and reentry. Readline comes pre-installed on many Unix-like platforms, and it’s available online. On Windows, you can install and use the third-party module pyreadline.
When readline is available, Python uses it for all line-oriented input, such as input. The interactive Python interpreter always tries to load readline to enable line editing and recall for interactive sessions. Some readline functions control advanced functionality, particularly history, for recalling lines entered in previous sessions, and completion, for context-sensitive completion of the word being entered. (See the GNU Readline docs for details on configuration commands.) You can access the module’s functionality using the following functions:
|
add_history |
Adds string |
|
clear_history |
Clears the history buffer. |
|
get_completer |
Returns the current completer function (as last set by |
|
get_history_length |
Returns the number of lines of history to be saved to the history file. When the result is less than |
|
parse_and_bind |
Gives Readline a configuration command. To let the user hit Tab to request completion, call A good completion function is in the module
For the rest of this interactive session, you can hit Tab during line editing and get completion for global names and object attributes. |
|
read_history_file |
Loads history lines from the text file at path |
|
read_init_file |
Makes Readline load a text file: each line is a configuration command. When |
|
set_completer |
Sets the completion function. When |
|
set_history_length |
Sets the number of lines of history that are to be saved to the history file. When |
|
write_history_file |
Saves history lines to the text file whose name or path is |
“Terminals” today are usually text windows on a graphical screen. You may also, in theory, use a true terminal, or (perhaps a tad less theoretical, but, these days, not by much) the console (main screen) of a personal computer in text mode. All such “terminals” in use today offer advanced text I/O functionality, accessed in platform-dependent ways. The curses package works on Unix-like platforms; for a cross-platform (Windows, Unix, Mac) solution, you may use third-party package UniCurses. The msvcrt module, on the contrary, exists only on Windows.
The classic Unix approach to advanced terminal I/O is named curses, for obscure historical reasons.1 The Python package curses affords reasonably simple use, but still lets you exert detailed control if required. We cover a small subset of curses, just enough to let you write programs with rich-text I/O functionality. (See Eric Raymond’s tutorial Curses Programming with Python for more). Whenever we mention “the screen” in this section, we mean the screen of the terminal (usually, these days, that’s the text window of a terminal-emulator program).
The simplest and most effective way to use curses is through the curses.wrapper function:
|
wrapper |
When you call
|
curses models text and background colors as character attributes. Colors available on the terminal are numbered from 0 to curses.COLORS. The function color_content takes a color number n as its argument and returns a tuple (r, g, b) of integers between 0 and 1000 giving the amount of each primary color in n. The function color_pair takes a color number n as its argument and returns an attribute code that you can pass to various methods of a curses.Window object in order to display text in that color.
curses lets you create multiple instances of type curses.Window, each corresponding to a rectangle on the screen. You can also create exotic variants, such as instances of Panel, polymorphic with Window but not tied to a fixed screen rectangle. You do not need such advanced functionality in simple curses programs: just use the Window object stdscr that curses.wrapper gives you. Call w.refresh() to ensure that changes made to any Window instance w, including stdscr, show up on screen. curses can buffer the changes until you call refresh.
An instance w of Window supplies, among many others, the following frequently used methods:
|
addstr |
Puts the characters in the string |
|
clrtobot, clrtoeol |
|
|
delch |
Deletes one character from |
|
deleteln |
Deletes from |
|
erase |
Writes spaces to the entire terminal screen. |
|
getch |
Returns an integer If you have set window |
|
getyx |
Returns |
|
insstr |
Inserts the characters in string |
|
move |
Moves |
|
nodelay |
Sets |
|
refresh |
Updates window |
The curses.textpad module supplies the Textpad class, which lets you support advanced input and text editing.
The msvcrt module, available only on Windows, supplies functions that let Python programs access a few proprietary extras supplied by the Microsoft Visual C++’s runtime library msvcrt.dll. Some msvcrt functions let you read user input character by character rather than reading a full line at a time, as listed here:
The cmd module offers a simple way to handle interactive sessions of commands. Each command is a line of text. The first word of each command is a verb defining the requested action. The rest of the line is passed as an argument to the method that implements the verb’s action.
The cmd module supplies the class Cmd to use as a base class, and you define your own subclass of cmd.Cmd. Your subclass supplies methods with names starting with do_ and help_, and may optionally override some of Cmd’s methods. When the user enters a command line such as verb and the rest, as long as your subclass defines a method named do_verb, Cmd.onecmd calls:
self.do_verb('and the rest')
Similarly, as long as your subclass defines a method named help_verb, Cmd.do_help calls the method when the command line starts with 'help verb' or '?verb'. Cmd shows suitable error messages if the user tries to use, or asks for help about, a verb for which the needed method is not defined.
Your subclass of cmd.Cmd, if it defines its own __init__ special method, must call the base class’s __init__, whose signature is as follows:
|
__init__ |
Initializes the instance If your subclass does not define |
An instance c of a subclass of the class Cmd supplies the following methods (many of these methods are “hooks” meant to be optionally overridden by the subclass):
|
cmdloop |
Performs an interactive session of line-oriented commands.
|
|
default |
|
|
do_help |
|
|
emptyline |
|
|
onecmd |
|
|
postcmd |
|
|
postloop |
|
|
precmd |
|
|
preloop |
|
An instance c of a subclass of the class Cmd supplies the following attributes:
identcharsA string whose characters are all those that can be part of a verb; by default, c.identchars contains letters, digits, and an underscore (_).
introThe message that cmdloop outputs first, when called with no argument.
lastcmdThe last nonblank command line seen by onecmd.
promptThe string that cmdloop uses to prompt the user for interactive input. You almost always bind c.prompt explicitly, or override prompt as a class attribute of your subclass; the default Cmd.prompt is just '(Cmd) '.
use_rawinputWhen false (default is true), cmdloop prompts and inputs via calls to methods of sys.stdout and sys.stdin, rather than via input.
Other attributes of Cmd instances, which we do not cover here, let you exert fine-grained control on many formatting details of help messages.
The following example shows how to use cmd.Cmd to supply the verbs print (to output the rest of the line) and stop (to end the loop):
importcmdclassX(cmd.Cmd):defdo_print(self,rest):(rest)defhelp_print(self):('print (any string): outputs (any string)')defdo_stop(self,rest):returnTruedefhelp_stop(self):('stop: terminates the command loop')if__name__=='__main__':X().cmdloop()
A session using this example might proceed as follows:
C:\>python examples/chapter10/cmdex.py(Cmd)helpDocumented commands (type help <topic>):= == == == == == == == == == == == == == == == == == == == =print stopUndocumented commands:= == == == == == == == == == == =help(Cmd)help printprint (any string): outputs (any string)(Cmd)print hi therehi there(Cmd)stop
Most programs present some information to users as text. Such text should be understandable and acceptable to the user. For example, in some countries and cultures, the date “March 7” can be concisely expressed as “3/7.” Elsewhere, “3/7” indicates “July 3,” and the string that means “March 7” is “7/3.” In Python, such cultural conventions are handled with the help of the standard module locale.
Similarly, a greeting can be expressed in one natural language by the string “Benvenuti,” while in another language the string to use is “Welcome.” In Python, such translations are handled with the help of standard module gettext.
Both kinds of issues are commonly called internationalization (often abbreviated i18n, as there are 18 letters between i and n in the full spelling)—a misnomer, since the same issues apply to different languages or cultures within a single nation.
Python’s support for cultural conventions imitates that of C, slightly simplified. A program operates in an environment of cultural conventions known as a locale. The locale setting permeates the program and is typically set at program startup. The locale is not thread-specific, and the locale module is not thread-safe. In a multithreaded program, set the program’s locale before starting secondary threads.
If your application needs to handle multiple locales at the same time in a single process—whether that’s in threads or asynchronously—locale is not the answer, due to its process-wide nature. Consider alternatives such as PyICU, mentioned in “More Internationalization Resources”.
If a program does not call locale.setlocale, the locale is a neutral one known as the C locale. The C locale is named from this architecture’s origins in the C language; it’s similar, but not identical, to the U.S. English locale. Alternatively, a program can find out and accept the user’s default locale. In this case, the locale module interacts with the operating system (via the environment or in other system-dependent ways) to find the user’s preferred locale. Finally, a program can set a specific locale, presumably determining which locale to set on the basis of user interaction or via persistent configuration settings.
Locale setting is normally performed across the board for all relevant categories of cultural conventions. This wide-spectrum setting is denoted by the constant attribute LC_ALL of module locale. However, the cultural conventions handled by locale are grouped into categories, and, in some cases, a program can choose to mix and match categories to build up a synthetic composite locale. The categories are identified by the following constant attributes of the locale module:
LC_COLLATEString sorting; affects functions strcoll and strxfrm in locale
LC_CTYPECharacter types; affects aspects of the module string (and string methods) that have to do with lowercase and uppercase letters
LC_MESSAGESMessages; may affect messages displayed by the operating system—for example, the function os.strerror and module gettext
LC_MONETARYFormatting of currency values; affects function locale.localeconv
LC_NUMERICFormatting of numbers; affects the functions atoi, atof, format, localeconv, and str in locale
LC_TIMEFormatting of times and dates; affects the function time.strftime
The settings of some categories (denoted by LC_CTYPE, LC_TIME, and LC_MESSAGES) affect behavior in other modules (string, time, os, and gettext, as indicated). Other categories (denoted by LC_COLLATE, LC_MONETARY, and LC_NUMERIC) affect only some functions of locale itself.
The locale module supplies the following functions to query, change, and manipulate locales, as well as functions that implement the cultural conventions of locale categories LC_COLLATE, LC_MONETARY, and LC_NUMERIC:
A key issue in internationalization is the ability to use text in different natural languages, a task known as localization. Python supports localization via the module gettext, inspired by GNU gettext. The gettext module is optionally able to use the latter’s infrastructure and APIs, but also offers a simpler, higher-level approach, so you don’t need to install or study GNU gettext to use Python’s gettext effectively.
For full coverage of gettext from a different perspective, see the online docs.
gettext does not deal with automatic translation between natural languages. Rather, it helps you extract, organize, and access the text messages that your program uses. Pass each string literal subject to translation, also known as a message, to a function named _ (underscore) rather than using it directly. gettext normally installs a function named _ in the __builtin__ module. To ensure that your program runs with or without gettext, conditionally define a do-nothing function, named _, that just returns its argument unchanged. Then you can safely use _('message') wherever you would normally use a literal 'message' that should be translated if feasible. The following example shows how to start a module for conditional use of gettext:
try:_exceptNameError:def_(s):returnsdefgreet():(_('Hello world'))
If some other module has installed gettext before you run this example code, the function greet outputs a properly localized greeting. Otherwise, greet outputs the string 'Hello world' unchanged.
Edit your source, decorating message literals with function _. Then use any of various tools to extract messages into a text file (normally named messages.pot) and distribute the file to the people who translate messages into the various natural languages your application must support. Python supplies a script pygettext.py (in directory Tools/i18n in the Python source distribution) to perform message extraction on your Python sources.
Each translator edits messages.pot to produce a text file of translated messages with extension .po. Compile the .po files into binary files with extension .mo, suitable for fast searching, using any of various tools. Python supplies script Tools/i18n/msgfmt.py for this purpose. Finally, install each .mo file with a suitable name in a suitable directory.
Conventions about which directories and names are suitable differ among platforms and applications. gettext’s default is subdirectory share/locale/<lang>/LC_MESSAGES/ of directory sys.prefix, where <lang> is the language’s code (two letters). Each file is named <name>.mo, where <name> is the name of your application or package.
Once you have prepared and installed your .mo files, you normally execute, at the time your application starts up, some code such as the following:
importos,gettextos.environ.setdefault('LANG','en')# application-default languagegettext.install('your_application_name')
This ensures that calls such as _('message') return the appropriate translated strings. You can choose different ways to access gettext functionality in your program—for example, if you also need to localize C-coded extensions, or to switch between languages during a run. Another important consideration is whether you’re localizing a whole application, or just a package that is distributed separately.
gettext supplies many functions; the most often used ones are:
|
install |
Installs in Python’s built-in namespace a function named |
|
translation |
Searches for a .mo file similarly to function
|
Internationalization is a very large topic. For a general introduction, see Wikipedia. One of the best packages of code and information for internationalization is ICU, which also embeds the Unicode Consortium’s excellent Common Locale Data Repository (CLDR) database of locale conventions, and code to access the CLDR. To use ICU in Python, install the third-party package PyICU.
1 “Curses” does describe well the typical utterances of programmers faced with this rich, complicated approach.