Regular Expressions Cookbook, 2nd Edition
by Steven Levithan
Published by
O'Reilly Media, Inc., 2012
and
Tags
| Regex options: None |
| Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
RegexMagic runs on Windows 98, ME, 2000, XP, Vista, 7, and 8. For Linux and Apple fans, RegexMagic also runs well on VMware, Parallels, CrossOver Office, and with a few issues on WINE. You can download a free evaluation copy of RegexMagic at http://www.regexmagic.com/RegexMagicCookbook.exe. Except for the user forum, the trial is fully functional for seven days of actual use.
Creating a simple online regular expression tester is easy. If you have some basic web development skills, the information in Chapter 3 is all you need to roll your own. Hundreds of people have already done this; a few have added some extra features that make them worth mentioning.
RegexPlanet is a website developed by Andrew Marcuse. Its claim to fame is that it allows you to test your regexes against a larger variety of regular expression libraries than any other regex tester we are aware of. On the home page you’ll find links to testers for Java, JavaScript, .NET, Perl, PHP, Python, and Ruby. They all use the same basic interface. Only the list of options is adapted to those of each programming language. Figure 1-4 shows the .NET version.
Type or paste your regular expression into the “regular expression” box. If you want to test a search-and-replace, paste the replacement text into the “replacement” box. You can test your regex against as many different subject strings as you like. Paste your subject strings into the “input” boxes. Click “more inputs” if you need more than five. The “regex” and “input” boxes allow you to type or paste in multiple lines of text, even though they only show one line at a time. The arrows at the right are the scrollbar.
When you’re done, click the “test” button to send all your strings to the regexplanet.com server. The resulting page, as shown in Figure 1-4, lists the test results at the top. The first two columns repeat your input. The remaining columns show the results of various function calls. These columns are different for the various programming languages that the site supports.
Lars Olav Torvik has put a great little regular expression tester online at http://regex.larsolavtorvik.com (see Figure 1-5).
To start, select the regular expression flavor you’re working
with by clicking on the flavor’s name at the top of the page. Lars
offers PHP PCRE, PHP POSIX, and JavaScript. PHP PCRE, the PCRE regex
flavor discussed in this book, is used by PHP’s preg functions. POSIX is an old and limited
regex flavor used by PHP’s ereg
functions, which are not discussed in this book. If you select
JavaScript, you’ll be working with your browser’s JavaScript
implementation.
Type your regular expression into the Pattern field and your subject text into the Subject field. A moment later, the Matches field displays your subject text with highlighted regex matches. The Code field displays a single line of source code that applies your regex to your subject text. Copying and pasting this into your code editor saves you the tedious job of manually converting your regex into a string literal. Any string or array returned by the code is displayed in the Result field. Because Lars used Ajax technology to build his site, results are updated in just a few moments for all flavors. To use the tool, you have to be online, as PHP is processed on the server rather than in your browser.
The second column displays a list of regex commands and regex options. These depend on the regex flavor. The regex commands typically include match, replace, and split operations. The regex options consist of common options such as case insensitivity, as well as implementation-specific options. These commands and options are described in Chapter 3.
http://www.nregex.com (Figure 1-6) is a straightforward online regex tester built on .NET technology by David Seruyange. It supports the .NET 2.0 regex flavor, which is also used by .NET 3.0, 3.5, and 4.0.
The layout of the page is somewhat confusing. Enter your regular
expression into the field under the Regular Expression label, and set
the regex options using the checkboxes below that. Enter your subject
text in the large box at the bottom, replacing the default If I just had $5.00 then "she"
wouldn't be so @#$! mad.. If your subject is a web page,
type the URL in the Load Target From URL field, and click the Load
button under that input field. If your subject is a file on your hard
disk, click the Browse button, find the file you want, and then click
the Load button under that input field.
Your subject text will appear duplicated in the “Matches &
Replacements” field at the center of the web page, with the regex
matches highlighted. If you type something into the Replacement String
field, the result of the search-and-replace is shown instead. If your
regular expression is invalid, ...
appears.
The regex matching is done in .NET code running on the server, so you need to be online for the site to work. If the automatic updates are slow, perhaps because your subject text is very long, tick the Manually Evaluate Regex checkbox above the field for your regular expression to show the Evaluate button. Click that button to update the “Matches & Replacements” display.
Michael Lovitt put a minimalistic regex tester online at http://www.rubular.com (Figure 1-7). At the time of writing, it lets you choose between Ruby 1.8.7 and Ruby 1.9.2. This allows you to test both the Ruby 1.8 and Ruby 1.9 regex flavors used in this book.
Enter your regular expression in the box between the two forward
slashes under “Your regular expression.” You can turn on case
insensitivity by typing an i in the
small box after the second slash. Similarly, if you like, turn on the
option “the dot matches line breaks” by typing an m in the same box. im turns on both options. Though these
conventions may seem a bit user-unfriendly if you’re new to Ruby, they
conform to the /regex/im syntax used to specify a regex
in Ruby source code.
Type or paste your subject text into the “Your test string” box, and wait a moment. A new “Match result” box appears to the right, showing your subject text with all regex matches highlighted.
Sergey Evdokimov created several regular expression
testers for Java developers. The home page at http://www.myregexp.com (Figure 1-8)
offers an online regex tester. It’s a Java applet that runs in your
browser. The Java 4 (or later) runtime needs to be installed on your
computer. The applet uses the java.util.regex
package to evaluate your regular expressions, which is new in Java 4.
In this book, the “Java” regex flavor refers to this package.
Type your regular expression into the Regular Expression box. Use the Flags menu to set the regex options you want. Three of the options also have direct checkboxes.
If you want to test a regex that already exists as a string in Java code, copy the whole string to the clipboard. In the myregexp.com tester, click on the Edit menu, and then “Paste Regex from Java String.” In the same menu, pick “Copy Regex for Java Source” when you’re done editing the regular expression. The Edit menu has similar commands for JavaScript and XML as well.
Below the regular expression, there are four tabs that run four different tests:
Highlights all regular expression matches in the sample
text. These are the matches found by the Matcher.find() method in Java.
Tests whether the regular expression matches the sample
text entirely. If it does,
the whole text is highlighted. This is what the
String.matches()
and Matcher.matches() methods do.
The second box at the right shows the array of strings
returned by String.split() or Pattern.split() when used with your
regular expression and sample text.
Type in a replacement text, and the box at the right shows
the text returned by String.replaceAll() or Matcher.replaceAll().
At the top of the page at http://www.myregexp.com, you can click the link to get Sergey’s regex tester as a plug-in for Eclipse.
Expresso (not to be confused with caffeine-laden espresso) is a .NET application for creating and testing regular expressions. You can download it at http://www.ultrapico.com/Expresso.htm. The .NET Framework 2.0 or later must be installed on your computer.
The download is a free 60-day trial. After the trial, you have to register or Expresso will (mostly) stop working. Registration is free, but requires you to give the Ultrapico folks your email address. The registration key is sent by email.
Expresso displays a screen like the one shown in Figure 1-9. The Regular Expression box where you type in your regular expression is permanently visible. No syntax highlighting is available. The Regex Analyzer box automatically builds a brief English-language analysis of your regular expression. It too is permanently visible.
In Design Mode, you can set matching options such as “Ignore Case” at the bottom of the screen. Most of the screen space is taken up by a row of tabs where you can select the regular expression token you want to insert. If you have two monitors or one large monitor, click the Undock button to float the row of tabs. Then you can build up your regular expression in the other mode (Test Mode) as well.
In Test Mode, type or paste your sample text in the lower-left corner. Then, click the Run Match button to get a list of all matches in the Search Results box. No highlighting is applied to the sample text. Click on a match in the results to select that match in the sample text.
The Expression Library shows a list of sample regular expressions and a list of recent regular expressions. Your regex is added to that list each time you press Run Match. You can edit the library through the Library menu in the main menu bar.
The Regulator, which you can download from http://sourceforge.net/projects/regulator/, is not safe for SCUBA diving or cooking-gas canisters; it is another .NET application for creating and testing regular expressions. The latest version requires .NET 2.0 or later. Older versions for .NET 1.x can still be downloaded. The Regulator is open source, and no payment or registration is required.
The Regulator does everything in one screen (Figure 1-10). The New Document tab is where you enter your regular expression. Syntax highlighting is automatically applied, but syntax errors in your regex are not made obvious. Right-click to select the regex token you want to insert from a menu. You can set regular expression options via the buttons on the main toolbar. The icons are a bit cryptic. Wait for the tool tip to see which option you’re setting with each button.
Below the area for your regex and to the right, click on the Input button to display the area for pasting in your sample text. Click the “Replace with” button to type in the replacement text, if you want to do a search-and-replace. Below the regex and to the left, you can see the results of your regex operation. Results are not updated automatically; you must click the Match, Replace, or Split button in the toolbar to update the results. No highlighting is applied to the input. Click on a match in the results to select it in the subject text.
The Regex Analyzer panel shows a simple English-language analysis of your regular expression, but it is not automatic or interactive. To update the analysis, select Regex Analyzer in the View menu, even if it is already visible. Clicking on the analysis only moves the text cursor.
SDL Regex Fuzzer’s fuzzy name does not make its purpose obvious. Microsoft bills it as “a tool to help test regular expressions for potential denial of service vulnerabilities.” You can download it for free at http://www.microsoft.com/en-us/download/details.aspx?id=20095. It requires .NET 3.5 to run.
What SDL Regex Fuzzer really does is to check whether there exists a subject string that causes your regular expression to execute in exponential time. In our book we call this “catastrophic backtracking.” We explain this in detail along with potential solutions in Recipe 2.15. Basically, a regex that exhibits catastrophic backtracking will cause your application to run forever or to crash. If your application is a server, that could be exploited in a denial-of-service attack.
Figure 1-11 shows the results of a test in SDL Regex Fuzzer. In Step 1 we pasted in a regular expression from Recipe 2.15. Since this regex can never match non-ASCII characters, there’s no need to select that option in Step 2. Otherwise, we should have. We left Step 3 set to the default of 100 iterations. About five seconds after clicking the Start button in Step 4, SDL Regex Fuzzer showed a sample string that will cause our regex to fail in .NET 3.5.
Unfortunately, the usefulness of this tool is greatly limited because it only supports a small subset of the .NET regex syntax. When we tried to test the naïve solution from Recipe 2.15, which would definitely fail this test, we received the error message shown in Figure 1-12. Proper understanding of the concepts discussed in Recipe 2.15 is still the only way to make sure you don’t bring down your applications with overly complex regular expressions.
The name grep is derived from the g/re/p command that
performed a regular expression search in the Unix text editor ed, one of the first applications to
support regular expressions. This
command was so popular that all Unix systems now have a dedicated grep
utility for searching through files using a regular expression. If
you’re using Unix, Linux, or OS X, type man
grep into a terminal window to learn all about it.
The following three tools are Windows applications that do what grep does, and more.
PowerGREP, developed by Jan Goyvaerts, one of this book’s authors, is probably the most feature-rich grep tool available for the Microsoft Windows platform (Figure 1-13). PowerGREP uses a custom regex flavor that combines the best of the flavors discussed in this book. This flavor is labeled “JGsoft” in RegexBuddy.
To run a quick regular expression search, simply select Clear in the Action menu and type your regular expression into the Search box on the Action panel. Click on a folder in the File Selector panel, and select “Include File or Folder” or “Include Folder and Subfolders” in the File Selector menu. Then, select Execute in the Action menu to run your search.
To run a search-and-replace, select “search-and-replace” in the “action type” drop-down list at the top-left corner of the Action panel after clearing the action. A Replace box will appear below the Search box. Enter your replacement text there. All the other steps are the same as for searching.
PowerGREP has the unique ability to use up to five lists of regular expressions at the same time, with any number of regular expressions in each list. While the previous two paragraphs provide all you need to run simple searches like you can in any grep tool, unleashing PowerGREP’s full potential will take a bit of reading through the tool’s comprehensive documentation.
PowerGREP runs on Windows 2000, XP, Vista, 7, and 8. You can download a free evaluation copy at http://www.powergrep.com/PowerGREPCookbook.exe. Except for saving results and libraries, the trial is fully functional for 15 days of actual use. Though the trial won’t save the results shown on the Results panel, it will modify all your files for search-and-replace actions, just like the full version does.
Windows Grep (http://www.wingrep.com) is one of the oldest grep tools for Windows. Its age shows a bit in its user interface (Figure 1-14), but it does what it says on the tin just fine. It supports a limited regular expression flavor called POSIX ERE. For the features that it supports, it uses the same syntax as the flavors in this book. Windows Grep is shareware, which means you can download it for free, but payment is expected if you want to keep it.
To prepare a search, select Search in the Search menu. The screen that appears differs depending on whether you’ve selected Beginner Mode or Expert Mode in the Options menu. Beginners get a step-by-step wizard, whereas experts get a tabbed dialog.
When you’ve set up the search, Windows Grep immediately executes it, presenting you with a list of files in which matches were found. Click once on a file to see its matches in the bottom panel, and double-click to open the file. Select “All Matches” in the View menu to make the bottom panel show everything.
To run a search-and-replace, select Replace in the Search menu.
RegexRenamer (Figure 1-15) is not really a grep tool. Instead of searching through the contents of files, it searches and replaces through the names of files. You can download it at http://regexrenamer.sourceforge.net. RegexRenamer requires version 2.0 or later of the Microsoft .NET Framework.
Type your regular expression into the Match box and the
replacement text into the Replace box. Click /i to turn on case
insensitivity, and /g to replace all
matches in each filename rather than just the first. /x turns on
free-spacing syntax, which isn’t very useful, since you have only one
line to type in your regular expression.
Use the tree at the left to select the folder that holds the files you want to rename. You can set a file mask or a regex filter in the top-right corner. This restricts the list of files to which your search-and-replace regex will be applied. Using one regex to filter and another to replace is much handier than trying to do both tasks with just one regex.
Most modern text editors have at least basic support for regular expressions. In the search or search-and-replace panel, you’ll typically find a checkbox to turn on regular expression mode. Some editors, such as EditPad Pro, also use regular expressions for various features that process text, such as syntax highlighting or class and function lists. The documentation with each editor explains all these features. Some popular text editors with regular expression support include:
BBEdit (PCRE)
Boxer Text Editor (PCRE)
Dreamweaver (JavaScript)
EditPad Pro (custom flavor that combines the best of the flavors discussed in this book; labeled “JGsoft” in RegexBuddy)
Multi-Edit (PCRE, if you select the “Perl” option)
Nisus Writer Pro (Ruby 1.9 [Oniguruma])
Notepad++ (PCRE)
NoteTab (PCRE)
UltraEdit (PCRE)