Regular Expressions Cookbook, 2nd Edition
by Steven Levithan
Published by
O'Reilly Media, Inc., 2012
and
Tags
| Regex options: Free-spacing, case insensitive |
| Regex flavors: .NET, Java 7, PCRE 7, Perl 5.10, Ruby 1.9 |
\A [a-z][a-z0-9+\-.]*:// # Scheme ([a-z0-9\-._~%!$&'()*+,;=]+@)? # User ([a-z0-9\-._~%]+ # Named or IPv4 host |\[[a-z0-9\-._~%!$&'()*+,;=:]+\]) # IPv6+ host :(?P<port>[0-9]+) # Port number
| Regex options: Free-spacing, case insensitive |
| Regex flavors: PCRE, Perl 5.10, Python |
^[a-z][a-z0-9+\-.]*://([a-z0-9\-._~%!$&'()*+,;=]+@)?↵ ([a-z0-9\-._~%]+|\[[a-z0-9\-._~%!$&'()*+,;=:]+\]):([0-9]+)
| Regex options: Case insensitive |
| Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
\A [a-z][a-z0-9+\-.]*:// # Scheme ([a-z0-9\-._~%!$&'()*+,;=]+@)? # User ([a-z0-9\-._~%]+ # Named host |\[[a-f0-9:.]+\] # IPv6 host |\[v[a-f0-9][a-z0-9\-._~%!$&'()*+,;=:]+\]) # IPvFuture host :([0-9]+) # Port (/[a-z0-9\-._~%!$&'()*+,;=:@]+)*/? # Path (\?[a-z0-9\-._~%!$&'()*+,;=:@/?]*)? # Query (\#[a-z0-9\-._~%!$&'()*+,;=:@/?]*)? # Fragment \Z
| Regex options: Case insensitive |
| Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
^[a-z][a-z0-9+\-.]*:\/\/([a-z0-9\-._~%!$&'()*+,;=]+@)?↵ ([a-z0-9\-._~%]+|\[[a-f0-9:.]+\]|\[v[a-f0-9][a-z0-9\-._~%!$&'()*+,;=:]↵ +\]):([0-9]+)(\/[a-z0-9\-._~%!$&'()*+,;=:@]+)*\/?↵ (\?[a-z0-9\-._~%!$&'()*+,;=:@\/?]*)?(#[a-z0-9\-._~%!$&'()*+,;=:@\/?]*)?$
| Regex options: Case insensitive |
| Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python |
Extracting the port number from a URL is easy if you already know
that your subject text is a valid URL. We use ‹\A› or ‹^› to anchor the match to the start of the string.
‹[a-z][a-z0-9+\-.]*://›
skips over the scheme, and ‹([a-z0-9\-._~%!$&'()*+,;=]+@)?› skips over the
optional user. ‹([a-z0-9\-._~%]+|\[[a-z0-9\-._~%!$&'()*+,;=:]+\])›
skips over the hostname.
The port number is separated from the hostname with a colon, which
we add as a literal character to the regular expression. The port number
itself is simply a string of digits, easily matched with ‹[0-9]+›.
This regex will find a match only if the URL actually specifies a port number. When it does, the regex will match the scheme, user, host, and port number parts of the URL. When the regex finds a match, you can retrieve the text matched by the third capturing group to get the port number without any delimiters or other URL parts.
The other two groups are used to make the username optional, and to keep the two alternatives for the hostname together. Recipe 2.9 tells you all about capturing groups. See Recipe 3.9 to learn how to retrieve text matched by capturing groups in your favorite programming language.
If you don’t already know that your subject text is a valid URL, you can use a simplified version of the regex from Recipe 8.7. Since we want to extract the port number, we can exclude URLs that don’t specify a port number. This makes the regular expression quite a bit simpler. It’s very similar to the one we used in Recipe 8.10.
The only difference is that this time the port number isn’t optional, and we moved the port number’s capturing group to exclude the colon that separates the port number from the hostname. The capturing group’s number is 3.
If you want a regex that matches any valid URL, including those that don’t specify the port, you can use one of the regexes from Recipe 8.7. The first regex in that recipe captures the port, if present, in the fifth capturing group.
Recipe 3.9 shows code to get the text matched by a particular part (capturing group) of a regex. Use this to get the port number.
Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.5 explains anchors. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.11 explains named capturing groups. Recipe 2.12 explains repetition. Recipe 2.18 explains how to add comments.