Table of Contents for
Regular Expressions Cookbook, 2nd Edition

Cover image for bash Cookbook, 2nd Edition

Regular Expressions Cookbook, 2nd Edition by Steven Levithan Published by O'Reilly Media, Inc., 2012

Replace matches with their corresponding literal characters

Use the regular expression just shown, together with the code in Recipe 3.16. The code examples listed there show how to perform a search-and-replace with replacement text generated in code.

When writing your replacement callback function, use backreferences to determine the appropriate replacement character. If group 1 captured a value, backreference 1 holds a numeric character reference in decimal notation, possibly with leading zeros. If group 2 captured a value, backreference 2 holds a numeric character reference in hexadecimal notation, possibly with leading zeros. If group 3 captured a value, backreference 3 holds an entity name. Use a lookup object, dictionary, hash, or whatever data structure is most convenient to map entity names to their corresponding characters by value or character code. You can then quickly identify which character to use as your replacement text.

The next section uses JavaScript to demonstrate how this all ties together.

Example JavaScript solution

// Accepts the match ($0) and backreferences; returns replacement text function callback($0, $1, $2, $3) { var charCode; // Name lookup object that maps to decimal character codes // Equivalent hexadecimal numbers are listed in comments var names = { quot: 34, // 0x22 amp: 38, // 0x26 apos: 39, // 0x27 lt: 60, // 0x3C gt: 62 // 0x3E }; // Decimal character reference if ($1) { charCode = parseInt($1, 10); // Hexadecimal character reference } else if ($2) { charCode = parseInt($2, 16); // Named entity with a lookup mapping } else if ($3 && ($3 in names)) { charCode = names[$3]; // Invalid or unknown entity name } else { return $0; // Return the match unaltered } // Return a literal character return String.fromCharCode(charCode); } // Replace all entities with literal text subject = subject.replace( /&(?:#([0-9]+)|#x([0-9a-fA-F]+)|([0-9a-zA-Z]+));/g, callback);

Discussion

The regular expression and example code we’ve shown in this recipe are intended for decoding snippets of XML-style text, rather than entire XML documents. The regex here can be useful when converting XML or (X)HTML content to plain text, but keep in mind that no restrictions are placed on where named or numbered entities can occur within the subject text. For instance, there is no special handling for skipping entities in XML CDATA blocks or HTML script blocks.

The JavaScript example code converts both decimal and hexadecimal numeric references to their corresponding literal characters, and additionally converts the five named entities that are defined in the XML standard: " (“), & (&), ' ('), < (<), and > (>). HTML includes many more named entities that aren’t covered here.^[22] If you follow the approach used in the example code, however, it should be straightforward to add as many more entity names as you need.

The JavaScript example code converts the following subject string:

"< &bogus; dec AA &lt; hex AA >"
To this:
"< &bogus; dec AA < hex AA >"
JavaScript doesn’t support Unicode code points beyond U+FFFF, so the provided code (or more specifically, the String.fromCharCode() method used within it) works correctly only with numeric character references up to  hexadecimal and  decimal. This shouldn’t be a problem in most cases, since characters beyond this range are rare. Numeric character references with numbers above this range are invalid in the first edition of the XML 1.0 standard.
Tip
Some programming languages and XML APIs have built-in functions to perform XML or HTML entity decoding. For instance, in PHP 4.3 and later you can use the function html_entity_decode(). It might still be helpful to implement your own method since such functions vary in which entity names they recognize. In some cases, such as with Ruby’s CGI::unescapeHTML(), even fewer than the standard five XML named entities are recognized.

Table of Contents for
Regular Expressions Cookbook, 2nd Edition

9.6. Decode XML Entities

Problem

Solution

Regular expression

Replace matches with their corresponding literal characters

Example JavaScript solution

Discussion

Tip

See Also

Table of Contents for Regular Expressions Cookbook, 2nd Edition

9.6. Decode XML Entities

Problem

Solution

Regular expression

Replace matches with their corresponding literal characters

Example JavaScript solution

Discussion

Tip

See Also

Table of Contents for
Regular Expressions Cookbook, 2nd Edition