Processing Markup and Data Formats with Regular Expressions

This final chapter focuses on common tasks that come up when working with an assortment of common markup languages and data formats: HTML, XHTML, XML, CSV, and INI. Although we’ll assume at least basic familiarity with these technologies, a brief description of each is included next to make sure we’re on the same page before digging in. The descriptions concentrate on the basic syntax rules needed to correctly search through the data structures of each format. Other details will be introduced as we encounter relevant issues.

Although it’s not always apparent on the surface, some of these formats can be surprisingly complex to process and manipulate accurately, at least using regular expressions. When programming, it’s usually best to use dedicated parsers and APIs instead of regular expressions when performing many of the tasks in this chapter, especially if accuracy is paramount (e.g., if your processing might have security implications). However, we don’t ascribe to a dogmatic view that XML-style markup should never be processed with regular expressions. There are cases when regular expressions are a great tool for the job, such as when making one-time edits in a text editor, scraping data from a limited set of HTML files, fixing broken XML files, or dealing with file formats that look like but aren’t quite XML. There are some issues to be aware of, but reading through this chapter will ensure that you don’t stumble into them blindly.

For help with implementing parsers that use regular expressions to tokenize custom data formats, see Construct a Parser.

Basic Rules for Formats Covered in This Chapter

Following are the basic syntax rules for HTML, XHTML, XML, CSV, and INI files. Keep in mind that some of the difficulties we’ll encounter throughout this chapter involve how we should handle cases that deviate from the following rules in expected or unexpected ways.

Hypertext Markup Language (HTML)

HTML is used to describe the structure, semantics, and appearance of billions of web pages and other documents. Although processing HTML using regular expressions is a popular task, you should know up front that the language is poorly suited to the rigidity and precision of regular expressions. This is especially true of the bastardized HTML that is common on many web pages, thanks in part to the extreme tolerance for poorly constructed HTML that web browsers are known for. In this chapter we’ll concentrate on the rules needed to process the key components of well-formed HTML: elements (and the attributes they contain), character references, comments, and document type declarations. This book covers HTML 4.01 (finalized in 1999) and the latest HTML5 draft as of mid 2012.

The basic HTML building blocks are called elements. Elements are written using tags, which are surrounded by angle brackets. Elements usually have both a start tag (e.g., <html>) and end tag (</html>). An element’s start tag may contain attributes, which are described later. Between the tags is the element’s content, which can be composed of text and other elements or left empty. Elements may be nested, but cannot overlap (e.g., <div><div></div></div> is OK, but not <div><span></div></span>). For some elements (such as <p>, which marks a paragraph), the end tag is optional. A few elements (including <br>, which terminates a line) cannot contain content, and never use an end tag. However, an empty element may still contain attributes. Empty elements may optionally end with />, as in <br/>,. HTML element names start with a letter from A–Z. All valid elements use only letters and numbers in their names. Element names are case-insensitive.

<script> and <style> elements warrant special consideration because they let you embed scripting language code and stylesheets in your document. These elements end after the first occurrence of </style> or </script>, even if it appears within a comment or string inside the style or scripting language.

Attributes appear within an element’s start tag after the element name, and are separated by one or more whitespace characters. Most attributes are written as name-value pairs. The following example shows an <a> (anchor) element with two attributes and the content “Click me!”:

<a href="http://www.regexcookbook.com"
    title = 'Regex Cookbook'>Click me!</a>

As shown here, an attribute’s name and value are separated by an equals sign and optional whitespace. The value is enclosed with single or double quotes. To use the enclosing quote type within the value, you must use a character reference (described next). The enclosing quote characters are not required if the value does not contain any of the characters double quote, single quote, grave accent, equals, less than, greater than, or whitespace (written in regex, that’s ‹^[^"'`=<>\s]+$›). A few attributes (such as the selected and checked attributes used with some form elements) affect the element that contains them simply by their presence, and do not require a value. In these cases, the equals sign that separates an attribute’s name and value is also omitted. Alternatively, these “minimized” attributes may reuse their name as their value (e.g., selected="selected"). Attribute names start with a letter from A–Z. All valid attributes use only letters, hyphens, and colons in their names. Attributes may appear in any order, and their names are case-insensitive.

HTML5 defines more than 2,000 named character references^[15] and more than a million numeric character references (collectively, we’ll call these character references). Numeric character references refer to a character by its Unicode code point, and use the format &#nnnn; or &#xhhhh;, where nnnn is one or more decimal digits from 0–9 and hhhh is one or more hexadecimal digits 0–9 and A–F (case-insensitive). Named character references are written as &entityname; (case-sensitive, unlike most other aspects of HTML), and are especially helpful when entering literal characters that are sensitive in some contexts, such as angle brackets (< and >), double quotes ("), and ampersands (&).

Also common is the   entity (no-break space, position 0xA0), which is particularly useful since all occurrences of this character are rendered, even when they appear in sequence. Spaces, tabs, and line breaks are all normally rendered as a single space character, even if many of them are entered in a row. The ampersand character (&) cannot be used outside of character references.

HTML comments have the following syntax:

<!-- this is a comment -->
<!-- so is this, but this comment
    spans more than one line -->

Content within comments has no special meaning, and is hidden from view by most user agents. For compatibility with ancient (pre-1995) browsers, some people surround the content of <script> and <style> elements with an HTML comment. Modern browsers ignore these comments and process the script or style content normally.

HTML documents often start with a document type declaration (informally, a DOCTYPE), which identifies the permitted and prohibited content for the document. The DOCTYPE looks a bit similar to an HTML element, as shown in the following line used with documents wishing to conform to the HTML 4.01 strict definition:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
    "http://www.w3.org/TR/html4/strict.dtd">

Here is the standard HTML5 DOCTYPE:

<!DOCTYPE html>

Finally, HTML5 allows CDATA sections, but only within embedded MathML and SVG content. CDATA sections were brought over from XML, and are used to escape blocks of text. They begin with the string <![CDATA[ and end with the first occurrence of ]]>.

So that’s the physical structure of an HTML document in a nutshell.^[16] Be aware that real-world HTML is often rife with deviations from these rules, and that most browsers are happy to accommodate the deviations. Beyond these basics, each element has restrictions on the content and attributes that may appear within it in order for an HTML document to be considered valid. Such content rules are beyond the scope of this book, but O’Reilly’s HTML & XHTML: The Definitive Guide by Chuck Musciano and Bill Kennedy is a good source if you need more information.

Tip

Because the syntax of HTML is very similar to XHTML and XML (both described next), many regular expressions in this chapter are written to support all three markup languages.

Extensible Hypertext Markup Language (XHTML)

XHTML was designed as the successor to HTML 4.01, and migrated HTML from its SGML heritage to an XML foundation. However, development of HTML continued separately. XHTML5 is now being developed as part of the HTML5 specification, and will be the XML serialization of HTML5 rather than introducing new features of its own. This book covers XHTML 1.0, 1.1, and 5.^[17] Although XHTML syntax is largely backward-compatible with HTML, there are a few key differences from the HTML structure we’ve just described:

XHTML documents may start with an XML declaration such as <?xml version="1.0" encoding="UTF-8"?>.
Nonempty elements must have a closing tag. Empty elements must either use a closing tag or end with />.
Element and attribute names are case-sensitive and use lowercase.
Due to the use of XML namespace prefixes, both element and attribute names may include a colon, in addition to the characters found in HTML names.
Unquoted attribute values are not allowed. Attribute values must be enclosed in single or double quotes.
Attributes must have an accompanying value.

There are a number of other differences between HTML and XHTML that mostly affect edge cases and error handling, but generally they do not affect the regexes in this chapter. For more on the differences between HTML and XHTML, see http://www.w3.org/TR/xhtml1/#diffs and http://wiki.whatwg.org/wiki/HTML_vs._XHTML.

Tip

Because the syntax of XHTML is a subset of HTML (as of HTML5) and is formed from XML, many regular expressions in this chapter are written to support all three of these markup languages. Recipes that refer to “(X)HTML” handle HTML and XHTML equally. You usually cannot depend on a document using only HTML or XHTML conventions, since mix-ups are common and web browsers generally don’t mind.

Extensible Markup Language (XML)

XML is a general-purpose language designed primarily for sharing structured data. It is used as the foundation to create a wide array of markup languages, including XHTML, which we’ve just discussed. This book covers XML versions 1.0 and 1.1. A full description of XML features and grammar is beyond the scope of this book, but for our purposes, there are only a few key differences from the HTML syntax we’ve already described:

XML documents may start with an XML declaration such as <?xml version="1.0" encoding="UTF-8"?>, and may contain other, similarly formatted processing instructions. For example, <?xml-stylesheet type="text/xsl" href="transform.xslt"?> specifies that the XSL transformation file transform.xslt should be applied to the document.

The DOCTYPE may include internal markup declarations within square brackets. For example:

<!DOCTYPE example [
  <!ENTITY copy "&#169;">
  <!ENTITY copyright-notice "Copyright &copy; 2012, O'Reilly">
]>

Nonempty elements must have a closing tag. Empty elements must either use a closing tag or end with />.
XML names (which govern the rules for element, attribute, and character reference names) are case-sensitive, and may use a large group of Unicode characters. The allowed characters include A–Z, a–z, colon, and underscore, as well as 0–9, hyphen, and period after the first character. See Recipe 9.4 for more details.
Unquoted attribute values are not allowed. Attribute values must be enclosed in single or double quotes.
Attributes must have an accompanying value.

There are many other rules that must be adhered to when authoring well-formed XML documents, or if you want to write your own conforming XML parser. However, the rules we’ve just described (appended to the structure we’ve already outlined for HTML documents) are generally enough for simple regex searches.

Tip

Because the syntax of XML is very similar to HTML and forms the basis of XHTML, many regular expressions in this chapter are written to support all three markup languages. Recipes that refer to “XML-style” markup handle XML, XHTML, and HTML equally.

Comma-Separated Values (CSV)

CSV is an old but still very common file format used for spreadsheet-like data. The CSV format is supported by most spreadsheets and database management systems, and is especially popular for exchanging data between applications. Although there is no official CSV specification, an attempt at a common definition was published in October 2005 as RFC 4180 and registered with IANA as MIME type “text/csv.” Before this RFC was published, the CSV conventions used by Microsoft Excel had been established as more or less a de facto standard. Because the RFC specifies rules that are very similar to those used by Excel, this doesn’t present much of a problem. This chapter covers the CSV formats specified by RFC 4180 and used by Microsoft Excel 2003 and later.

As the name suggests, CSV files contain a list of values, known as record items or fields, that are separated by commas. Each row, or record, starts on a new line. The last field in a record is not followed by a comma. The last record in a file may or may not be followed by a line break. Throughout the entire file, each record should have the same number of fields.

The value of each CSV field may be unadorned or enclosed with double quotes. Fields may also be entirely empty. Any field that contains commas, double quotes, or line breaks must be enclosed in double quotes. A double quote appearing inside a field is escaped by preceding it with another double quote.

The first record in a CSV file is sometimes used as a header with the names of each column. This cannot be programmatically determined from the content of a CSV file alone, so some applications prompt the user to decide how the first row should be handled.

RFC 4180 specifies that leading and trailing spaces in a field are part of the value. Some older versions of Excel ignored these spaces, but Excel 2003 and later follow the RFC on this point. The RFC does not specify error handling for unescaped double quotes or pretty much anything else. Excel’s handling can be a bit unpredictable in edge cases, so it’s important to ensure that double quotes are escaped, fields containing double quotes are themselves enclosed with double quotes, and quoted fields do not contain leading or trailing spaces outside of the quotes.

The following CSV example demonstrates many of the rules we’ve just discussed. It contains two records with three fields each:

aaa,b b,"""c"" cc"
1,,"333, three,
still more threes"

Table 9-1 shows how the CSV content just shown would be displayed in a table.

Table 9-1. Example CSV output

^[15]Many characters have more than one corresponding named character reference in HTML5. For instance, the symbol ≈ has six: ≈, ≈, ≈, &thkap;, &thickapprox;, and &TildeTilde;.
^[16]HTML 4.01 defines some esoteric SGML features, including processing instructions (using a different syntax than XML) and shorthand markup, but recommends against their use. In this chapter, we act as if these features don’t exist, because browsers do the same don’t support them. If you wish, you can read about their syntax in Appendix B of the HTML 4.01 specification, in sections B.3.5–7. HTML5 explicitly removes support for these features, which browsers don’t use anyway.
^[17]If you’re wondering about the missing version numbers, XHTML 2.0 was in development by the W3C for several years before being scrapped in favor of a refocus on HTML5. XHTML version numbers 3–4 were skipped outright.

`aaa`	`b b`	`"c" cc`
`1`	(empty)	333, three, still more threes

Table of Contents for Regular Expressions Cookbook, 2nd Edition

Processing Markup and Data Formats with Regular Expressions

Basic Rules for Formats Covered in This Chapter

Tip

Tip

Tip

Table of Contents for
Regular Expressions Cookbook, 2nd Edition