Table of Contents for
JavaScript: The Good Parts

Version ebook / Retour

Cover image for bash Cookbook, 2nd Edition JavaScript: The Good Parts by Douglas Crockford Published by O'Reilly Media, Inc., 2008
  1. Cover
  2. JavaScript: The Good Parts
  3. SPECIAL OFFER: Upgrade this ebook with O’Reilly
  4. A Note Regarding Supplemental Files
  5. Preface
  6. Using Code Examples
  7. Safari® Books Online
  8. How to Contact Us
  9. Acknowledgments
  10. 1. Good Parts
  11. Analyzing JavaScript
  12. A Simple Testing Ground
  13. 2. Grammar
  14. Names
  15. Numbers
  16. Strings
  17. Statements
  18. Expressions
  19. Literals
  20. Functions
  21. 3. Objects
  22. Retrieval
  23. Update
  24. Reference
  25. Prototype
  26. Reflection
  27. Enumeration
  28. Delete
  29. Global Abatement
  30. 4. Functions
  31. Function Literal
  32. Invocation
  33. Arguments
  34. Return
  35. Exceptions
  36. Augmenting Types
  37. Recursion
  38. Scope
  39. Closure
  40. Callbacks
  41. Module
  42. Cascade
  43. Curry
  44. Memoization
  45. 5. Inheritance
  46. Object Specifiers
  47. Prototypal
  48. Functional
  49. Parts
  50. 6. Arrays
  51. Length
  52. Delete
  53. Enumeration
  54. Confusion
  55. Methods
  56. Dimensions
  57. 7. Regular Expressions
  58. Construction
  59. Elements
  60. 8. Methods
  61. 9. Style
  62. 10. Beautiful Features
  63. A. Awful Parts
  64. Scope
  65. Semicolon Insertion
  66. Reserved Words
  67. Unicode
  68. typeof
  69. parseInt
  70. +
  71. Floating Point
  72. NaN
  73. Phony Arrays
  74. Falsy Values
  75. hasOwnProperty
  76. Object
  77. B. Bad Parts
  78. with Statement
  79. eval
  80. continue Statement
  81. switch Fall Through
  82. Block-less Statements
  83. ++ −−
  84. Bitwise Operators
  85. The function Statement Versus the function Expression
  86. Typed Wrappers
  87. new
  88. void
  89. C. JSLint
  90. Members
  91. Options
  92. Semicolon
  93. Line Breaking
  94. Comma
  95. Required Blocks
  96. Forbidden Blocks
  97. Expression Statements
  98. for in Statement
  99. switch Statement
  100. var Statement
  101. with Statement
  102. =
  103. == and !=
  104. Labels
  105. Unreachable Code
  106. Confusing Pluses and Minuses
  107. ++ and −−
  108. Bitwise Operators
  109. eval Is Evil
  110. void
  111. Regular Expressions
  112. Constructors and new
  113. Not Looked For
  114. HTML
  115. JSON
  116. Report
  117. D. Syntax Diagrams
  118. E. JSON
  119. Using JSON Securely
  120. A JSON Parser
  121. Index
  122. About the Author
  123. Colophon
  124. SPECIAL OFFER: Upgrade this ebook with O’Reilly

Elements

Let's look more closely at the elements that make up regular expressions.

Regexp Choice

image with no caption

A regexp choice contains one or more regexp sequences. The sequences are separated by the | (vertical bar) character. The choice matches if any of the sequences match. It attempts to match each of the sequences in order. So:

"into".match(/in|int/)

matches the in in into. It wouldn't match int because the match of in was successful.

Regexp Sequence

image with no caption

A regexp sequence contains one or more regexp factors. Each factor can optionally be followed by a quantifier that determines how many times the factor is allowed to appear. If there is no quantifier, then the factor will be matched one time.

Regexp Factor

image with no caption

A regexp factor can be a character, a parenthesized group, a character class, or an escape sequence. All characters are treated literally except for the control characters and the special characters:

\ / [ ] ( ) { } ? + * | . ^ $

which must be escaped with a \ prefix if they are to be matched literally. When in doubt, any special character can be given a \ prefix to make it literal. The \ prefix does not make letters or digits literal.

An unescaped . matches any character except a line-ending character.

An unescaped ^ matches the beginning of the text when the lastIndex property is zero. It can also match line-ending characters when the m flag is specified.

An unescaped $ matches the end of the text. It can also match line-ending characters when the m flag is specified.

Regexp Escape

image with no caption

The backslash character indicates escapement in regexp factors as well as in strings, but in regexp factors, it works a little differently.

As in strings, \f is the formfeed character, \n is the newline character, \r is the carriage return character, \t is the tab character, and \u allows for specifying a Unicode character as a 16-bit hex constant. In regexp factors, \b is not the backspace character.

\d is the same as [0-9]. It matches a digit. \D is the opposite: [^0-9].

\s is the same as [\f\n\r\t\u000B\u0020\u00A0\u2028\u2029]. This is a partial set of Unicode whitespace characters. \S is the opposite: [^\f\n\r\t\u000B\u0020\u00A0\u2028\u2029].

\w is the same as [0-9A-Z_a-z]. \W is the opposite: [^0-9A-Z_a-z]. This is supposed to represent the characters that appear in words. Unfortunately, the class it defines is useless for working with virtually any real language. If you need to match a class of letters, you must specify your own class.

A simple letter class is [A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]. It includes all of Unicode's letters, but it also includes thousands of characters that are not letters. Unicode is large and complex. An exact letter class of the Basic Multilingual Plane is possible, but would be huge and inefficient. JavaScript's regular expressions provide extremely poor support for internationalization.

\b was intended to be a word-boundary anchor that would make it easier to match text on word boundaries. Unfortunately, it uses \w to find word boundaries, so it is completely useless for multilingual applications. This is not a good part.

\1 is a reference to the text that was captured by group 1 so that it can be matched again. For example, you could search text for duplicated words with:

var doubled_words = /([A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]+)\s+\1/gi;

doubled_words looks for occurrences of words (strings containing 1 or more letters) followed by whitespace followed by the same word.

\2 is a reference to group 2, \3 is a reference to group 3, and so on.

Regexp Group

image with no caption

There are four kinds of groups:

Capturing

A capturing group is a regexp choice wrapped in parentheses. The characters that match the group will be captured. Every capture group is given a number. The first capturing ( in the regular expression is group 1. The second capturing ( in the regular expression is group 2.

Noncapturing

A noncapturing group has a (?: prefix. A noncapturing group simply matches; it does not capture the matched text. This has the advantage of slight faster performance. Noncapturing groups do not interfere with the numbering of capturing groups.

Positive lookahead

A positive lookahead group has a (?= prefix. It is like a noncapturing group except that after the group matches, the text is rewound to where the group started, effectively matching nothing. This is not a good part.

Negative lookahead

A negative lookahead group has a (?! prefix. It is like a positive lookahead group, except that it matches only if it fails to match. This is not a good part.

Regexp Class

image with no caption

A regexp class is a convenient way of specifying one of a set of characters. For example, if we wanted to match a vowel, we could write (?:a|e|i|o|u), but it is more conveniently written as the class [aeiou].

Classes provide two other conveniences. The first is that ranges of characters can be specified. So, the set of 32 ASCII special characters:

! " # $ % & ' ( ) * +, - . / :
; < = > ? @ [ \ ] ^ _ ` { | } ˜

could be written as:

(?:!|"|#|\$|%|&|'|\(|\)|\*|\+|,|-|\.|\/|:|;|<|=|>|@|\[|\\|]|\^|_|` |\{|\||\}|˜)

but is slightly more nicely written as:

[!-\/:-@\[-`{-˜]

which includes the characters from ! through / and : through @ and [ through ` and { through ˜. It is still pretty nasty looking.

The other convenience is the complementing of a class. If the first character after the [ is ^, then the class excludes the specified characters.

So [^!-\/:-@\[-`{-˜] matches any character that is not one of the ASCII special characters.

Regexp Class Escape

image with no caption

The rules of escapement within a character class are slightly different than those for a regexp factor. [\b] is the backspace character. Here are the special characters that should be escaped in a character class:

- / [ \ ] ^

Regexp Quantifier

image with no caption

A regexp factor may have a regexp quantifier suffix that determines how many times the factor should match. A number wrapped in curly braces means that the factor should match that many times. So, /www/ matches the same as /w{3}/. {3,6} will match 3, 4, 5, or 6 times. {3,} will match 3 or more times.

? is the same as {0,1}. * is the same as {0,}. + is the same as {1,}.

Matching tends to be greedy, matching as many repetitions as possible up to the limit, if there is one. If the quantifier has an extra ? suffix, then matching tends to be lazy, attempting to match as few repetitions as possible. It is usually best to stick with the greedy matching.