Programming Languages and Regex Flavors

This chapter explains how to implement regular expressions with your programming language of choice. The recipes in this chapter assume you already have a working regular expression at your disposal; the previous chapters can help in that regard. Now you face the job of putting a regular expression into your source code and actually making it do something.

We’ve done our best in this chapter to explain exactly how and why each piece of code works the way it does. Because of the level of detail in this chapter, reading it from start to finish may get a bit tedious. If you’re reading Regular Expression Cookbook for the first time, we recommend you skim this chapter to get an idea of what can or needs to be done. Later, when you want to implement one of the regular expressions from the following chapters, come back here to learn exactly how to integrate the regexes with your programming language of choice.

Chapters 4 through 9 use regular expressions to solve real-world problems. Those chapters focus on the regular expressions themselves, and many recipes in those chapters don’t show any source code at all. To make the regular expressions you find in those chapters work, simply plug them into one of the code snippets in this chapter.

Because the other chapters focus on regular expressions, they present their solutions for specific regular expression flavors, rather than for specific programming languages. Regex flavors do not correspond one-on-one with programming languages. Scripting languages tend to have their own regular expression flavor built-in, and other programming languages rely on libraries for regex support. Some libraries are available for multiple languages, while certain languages have multiple libraries available for them.

Many Flavors of Regular Expressions describes all the regular expression flavors covered in this book. Many Flavors of Replacement Text lists the replacement text flavors, used for searching and replacing with a regular expression. All of the programming languages covered in this chapter use one of these flavors.

Languages Covered in This Chapter

This chapter covers eight programming languages. Each recipe has separate solutions for all eight programming languages, and many recipes also have separate discussions for all eight languages. If a technique applies to more than one language, we repeat it in the discussion for each of those languages. We’ve done this so you can safely skip the discussions of programming languages that you’re not interested in:

C#

C# uses the Microsoft .NET Framework. The System.Text.RegularExpressions classes use the “.NET” regular expression and replacement text flavor. This book covers C# 1.0 through 4.0, or Visual Studio 2002 until Visual Studio 2010.

VB.NET

This book uses VB.NET and Visual Basic.NET to refer to Visual Basic 2002 and later, to distinguish these versions from Visual Basic 6 and earlier. Visual Basic now uses the Microsoft .NET Framework. The System.Text.RegularExpressions classes use the “.NET” regular expression and replacement text flavor. This book covers Visual Basic 2002 until Visual Basic 2010.

Java

Java 4 is the first Java release to provide built-in regular expression support through the java.util.regex package. The java.util.regex package uses the “Java” regular expression and replacement text flavor. This book covers Java 4, 5, 6, and 7.

JavaScript

This is the regex flavor used in the programming language commonly known as JavaScript. All modern web browsers implement it: Internet Explorer (as of version 5.5), Firefox, Opera, Safari, and Chrome. Many other applications also use JavaScript as a scripting language.

Strictly speaking, in this book we use the term JavaScript to indicate the programming language defined in versions 3 and 5 of the ECMA-262 standard. This standard defines the ECMAScript programming language, which is better known through its implementations JavaScript and JScript in various web browsers.

ECMA-262v3 and ECMA-262v5 also define the regular expression and replacement text flavors used by JavaScript. Those flavors are labeled as “JavaScript” in this book.

XRegExp

XRegExp is an open source JavaScript library developed by Steven Levithan. You can download it at http://xregexp.com. XRegExp extends JavaScript’s regular expression syntax. XRegExp also provides replacement functions for JavaScript’s regex matching functions for better cross-browser consistency, as well as new higher-level functions that make tasks such as iterating over all matches easier.

Most recipes in this chapter do not have separate JavaScript and XRegExp solutions. You can use the standard JavaScript solutions with regular expressions created by XRegExp. In situations where XRegExp’s methods offer a significantly better solution, we show code for both standard JavaScript, as well as JavaScript with XRegExp.

PHP

PHP has three sets of regular expression functions. We strongly recommend using the preg functions. Therefore, this book only covers the preg functions, which are built into PHP as of version 4.2.0. This book covers PHP 4 and 5. The preg functions are PHP wrappers around the PCRE library. The PCRE regex flavor is indicated as “PCRE” in this book. Since PCRE does not include search-and-replace functionality, the PHP developers devised their own replacement text syntax for preg_replace. This replacement text flavor is labeled “PHP” in this book.

The mb_ereg functions are part of PHP’s “multibyte” functions, which are designed to work well with languages that are traditionally encoded with multibyte character sets, such as Japanese and Chinese. In PHP 5, the mb_ereg functions use the Oniguruma regex library, which was originally developed for Ruby. The Oniguruma regex flavor is indicated as “Ruby 1.9” in this book. Using the mb_ereg functions is recommended only if you have a specific requirement to deal with multibyte code pages and you’re already familiar with the mb_ functions in PHP.

The ereg group of functions is the oldest set of PHP regex functions, and are officially deprecated as of PHP 5.3.0. They don’t depend on external libraries, and implement the POSIX ERE flavor. This flavor offers only a limited feature set, and is not discussed in this book. POSIX ERE is a strict subset of the Ruby 1.9 and PCRE flavors. You can take the regex from any ereg function call and use it with mb_ereg or preg. For preg, you have to add Perl-style delimiters (Recipe 3.1).

Perl

Perl’s built-in support for regular expressions is the main reason why regexes are popular today. The regular expression and replacement text flavors used by Perl’s m// and s/// operators are labeled as “Perl” in this book. This book covers Perl 5.6, 5.8, 5.10, 5.12, and 5.14.

Python

Python supports regular expressions through its re module. The regular expression and replacement text flavor used by this module are labeled “Python” in this book. This book covers Python 2.4 until 3.2.

Ruby

Ruby has built-in support for regular expressions. This book covers Ruby 1.8 and Ruby 1.9. These two versions of Ruby have different default regular expression engines. Ruby 1.9 uses the Oniguruma engine, which has more regex features than the classic engine in Ruby 1.8. Regex Flavors Covered by This Book has more details on this.

In this chapter, we don’t talk much about the differences between Ruby 1.8 and 1.9. The regular expressions in this chapter are very basic, and they don’t use the new features in Ruby 1.9. Because the regular expression support is compiled into the Ruby language itself, the Ruby code you use to implement your regular expressions is the same, regardless of whether you’ve compiled Ruby using the classic regex engine or the Oniguruma engine. You could recompile Ruby 1.8 to use the Oniguruma engine if you need its features.

More Programming Languages

The programming languages in the following list aren’t covered by this book, but they do use one of the regular expression flavors in this book. If you use one of these languages, you can skip this chapter, but all the other chapters are still useful:

ActionScript

ActionScript is Adobe’s implementation of the ECMA-262 standard. As of version 3.0, ActionScript has full support for ECMA-262v3 regular expressions. This regex flavor is labeled “JavaScript” in this book. The ActionScript language is also very close to JavaScript. You should be able to adapt the JavaScript examples in this chapter for ActionScript.

C

C can use a wide variety of regular expression libraries. The open source PCRE library is likely the best choice out of the flavors covered by this book. You can download the full C source code at http://www.pcre.org. The code is written to compile with a wide range of compilers on a wide range of platforms.

C++

C++ can use a wide variety of regular expression libraries. The open source PCRE library is likely the best choice out of the flavors covered by this book. You can either use the C API directly or use the C++ class wrappers included with the PCRE download itself (see http://www.pcre.org).

On Windows, you could import the VBScript 5.5 RegExp COM object, as explained later for Visual Basic 6. That could be useful for regex consistency between a C++ backend and a JavaScript frontend.

C++ TR1 defines a <regex> header file that defines functions such as regex_search(), regex_match(), and regex_replace() that you can use to search through strings, validate strings, and search-and-replace through strings with regular expressions. The regular expression support in C++ TR1 is based on the Boost.Regex library. You can use the Boost.Regex library if your C++ compiler does not support TR1. You can find full documentation at http://www.boost.org/libs/regex/.

Delphi

Delphi XE was the first version of Delphi to have built-in support for regular expressions. The regex features are unchanged in Delphi XE2. The RegularExpressionsAPI unit is a thin wrapper around the PCRE library. You won’t use this unit directly.

The RegularExpressionsCore unit implements the TPerlRegEx class. It provides a full set of methods to search, replace, and split strings using regular expressions. It uses the UTF8String type for all strings, as PCRE is based on UTF-8. You can use the TPerlRegEx class in situations where you want full control over when strings are converted to and from UTF-8, or if your data is in UTF-8 already. You can also use this unit if you’re porting code from an older version of Delphi that used Jan Goyvaerts’s TPerlRegEx class. The RegularExpressionsCore unit is based on code that Jan Goyvaerts donated to Embarcadero.

The RegularExpressions unit is the one you’ll use most for new code. It implements records such as TRegex and TMatch that have names and methods that closely mimic the regular expression classes in the .NET Framework. Because they’re records, you don’t have to worry about explicitly creating and destroying them. They provide many static methods that allow you to use a regular expression with just a single line of code.

If you are using an older version of Delphi, your best choice is Jan Goyvaerts’s TPerlRegEx class. You can download the full source code at http://www.regexp.info/delphi.html. It is open source under the Mozilla Public License. The latest release of TPerlRegEx is fully compatible with the RegularExpressionsCore unit in Delphi XE. For new code written in Delphi 2010 or earlier, using the latest release of TPerlRegEx is strongly recommended. If you later migrate your code to Delphi XE, all you have to do is replace PerlRegEx with RegularExpressionsCore in the uses clause of your units. When compiled with Delphi 2009 or Delphi 2010, the PerlRegEx unit uses UTF8String and fully supports Unicode. When compiled with Delphi 2007 or earlier, the unit uses AnsiString and does not support Unicode.

Another popular PCRE wrapper for Delphi is the TJclRegEx class part of the JCL library at http://www.delphi-jedi.org. It is also open source under the Mozilla Public License.

Delphi Prism

In Delphi Prism, you can use the regular expression support provided by the .NET Framework. Simply add System.Text.RegularExpressions to the uses clause of any Delphi Prism unit in which you want to use regular expressions.

Once you’ve done that, you can use the same techniques shown in the C# and VB.NET code snippets in this chapter.

Groovy

You can use regular expressions in Groovy with the java.util.regex package, just as you can in Java. In fact, all of the Java solutions in this chapter should work with Groovy as well. Groovy’s own regular expression syntax merely provides notational shortcuts. A literal regex delimited with forward slashes is an instance of java.lang.String and the =~ operator instantiates java.util.regex.Matcher. You can freely mix the Groovy syntax with the standard Java syntax—the classes and objects are all the same.

PowerShell

PowerShell is Microsoft’s shell-scripting language, based on the .NET Framework. PowerShell’s built-in -match and -replace operators use the .NET regex flavor and replacement text as described in this book.

R

The R Project supports regular expressions via the grep, sub, and regexpr functions in the base package. All these functions take an argument labeled perl, which is FALSE if you omit it. Set it to TRUE to use the PCRE regex flavor as described in this book. The regular expressions shown for PCRE 7 work with R 2.5.0 and later. For earlier versions of R, use the regular expressions marked as “PCRE 4 and later” in this book. The “basic” and “extended” flavors supported by R are older and limited regex flavors not discussed in this book.

REALbasic

REALbasic has a built-in RegEx class. Internally, this class uses the UTF-8 version of the PCRE library. This means that you can use PCRE’s Unicode support, but you have to use REALbasic’s TextConverter class to convert non-ASCII text into UTF-8 before passing it to the RegEx class.

All regular expressions shown in this book for PCRE 7 will work with REALbasic 2011. One caveat is that in REALbasic, the “case insensitive” (Regex.Options.CaseSensitive) and “^ and $ match at line breaks” (Regex.Options.TreatTargetAsOneLine) options are on by default. If you want to use a regular expression from this book that does not tell you to turn on these matching modes, you have to turn them off explicitly in REALbasic.

Scala

Scala provides built-in regex support through the scala.util.matching package. This support is built on the regular expression engine in Java’s java.util.regex package. The regular expression and replacement text flavors used by Java and Scala are labeled “Java” in this book.

Visual Basic 6

Visual Basic 6 is the last version of Visual Basic that does not require the .NET Framework. That also means Visual Basic 6 cannot use the excellent regular expression support of the .NET Framework. The VB.NET code samples in this chapter won’t work with VB 6 at all.

Visual Basic 6 does make it very easy to use the functionality provided by ActiveX and COM libraries. One such library is Microsoft’s VBScript scripting library, which has decent regular expression capabilities starting with version 5.5. The scripting library implements the same regular expression flavor used in JavaScript, as standardized in ECMA-262v3. This library is part of Internet Explorer 5.5 and later. It is available on all computers running Windows XP or Vista, and previous versions of Windows if the user has upgraded to IE 5.5 or later. That includes almost every Windows PC that is used to connect to the Internet.

To use this library in your Visual Basic application, select Project|References in the VB IDE’s menu. Scroll down the list to find the item “Microsoft VBScript Regular Expressions 5.5”, which is immediately below the “Microsoft VBScript Regular Expressions 1.0” item. Make sure to tick the 5.5 version. The 1.0 version is only provided for backward compatibility, and its capabilities are less than satisfactory.

After adding the reference, you can see which classes and class members the library provides. Select View|Object Browser in the menu. In the Object Browser, select the “VBScript_RegExp_55” library in the drop-down list in the upper-left corner.

Previous Chapter

3. Programming with Regular Expressions

Next Chapter

3.1. Literal Regular Expressions in Source Code

Table of Contents for Regular Expressions Cookbook, 2nd Edition

Programming Languages and Regex Flavors

Languages Covered in This Chapter

More Programming Languages

Table of Contents for
Regular Expressions Cookbook, 2nd Edition