Searching for Strings
Searching for text in strings is a common activity and the built-in string function
find()
is all you need for simple searches. It returns the position (offset) of the find or –1 if not found.
>>> txt="The quick brown fox jumps over the lazy dog"
>>> txt.find('jump')
20
>>> txt.find('z')
37
>>> txt.find('green')
-1
More Complex Searches
There are often circumstances when the search is not so simple. Rather than a simple string, we need to look for a pattern and extract the information we really want from the matched text. Suppose for example, we wanted to extract all the URLs in links on a web page. Here are some example lines of HTML text from a real web page.
1 <link rel="alternate" type="application/rss+xml" title="RSS: 40 newest packages" href="https://pypi.python.org/pypi?:action=packages_rss"/>
2 <link rel="stylesheet" media="screen" href="/static/styles/screen-switcher-default.css" type="text/css"/>
3 <li><a class="" href="/pypi?%3Aaction=browse">Browse packages</a></li>
4 <li><a href="http://wiki.python.org/moin/CheeseShopTutorial">PyPI Tutorial</a></li>
There is quite a lot going on in the text here.
- Line 1 refers to an RSS feed.
- Line 2 has an href attribute, but it refers to a Cascading Style Sheets (CSS) file, not a link.
- Line 3 is a true link but the URL is relative; it doesn’t contain the web site part of the URL.
- Line 4 is a link to an external site.
How can we hope to use some software to find the links that we care about? Well, this is where regular expressions come in.
Introducing Regular Expressions1
Regular expressions
2 are a way of using pattern matching to find the text we are interested in. Not only are patterns matched, but the re module can extract the data we really want out of the matched text.
Many more examples could be written, and in fact there are whole books written about regular expressions (e.g., [16], [17]). There are many web sites, but the most useful is probably
http://www.regular-expressions.info
.
Note
A regex is a string containing both text and special characters that define a pattern that the re functions can use for matching.
Simple Searches
The simplest regex is a text string that you want to find in another string, as shown in Table 10-1.
Table 10-1.
Finding a Simple String
Regex | String Matched |
|---|---|
jumps
|
jumps
|
The Queen
|
The Queen
|
Pqr123
|
Pqr123
|
Using Special Characters
There are special characters, listed in Table 10-2, that influence how the match is to be performed.
Table 10-2.
Using Special Characters
Symbols | Description | Example |
|---|---|---|
literal
| Match a literal string |
Jumps
|
re1|re2
| Match string re1 OR re2 |
Yes|No
|
.
| Match any single character (except \n) |
J.mps
|
^
| Match start of string |
^The
|
$
| Match end of string |
well$
|
*
| Match 0 or more occurrences of preceding regex |
[A-Z]*
|
+
| Match 1 or more occurrences of preceding regex |
[A-Z]+
|
?
| Match 0 or 1 occurrences of preceding regex |
[a-z0-9]?
|
{m,n}
| Match between m and n occurrences of the preceding regex (n optional) |
[0-9]{2,4}
|
[...]
| Match any character from character class |
[aeiou]
|
[x-y]
| Match any character from range |
[0-9],[A-Za-z]
|
[^...]
| Do not match any character from character class |
[^aeiou]
|
There are a number of special characters, listed in Table 10-3, that can be matched, too.
Table 10-3.
Searching with Special Characters
Special Character | Description | Example |
|---|---|---|
\d
| Match any decimal digit |
BBC\d
|
\w
| Match any alphanumeric character |
Radio\w+
|
\s
| Match any whitespace character |
The\sBBC
|
Table 10-4 gives some examples of regular expressions and the strings that they would match.
Table 10-4.
Regular Expressions and Matching Strings
Regex | String(s) Matched |
|---|---|
smith|jones
|
smith, jones
|
UNE..O
| Any two characters between UN and O; e.g., UNESCO, UNEzyO, UNE99O
|
^The
| Any string that starts with The
|
end$
| Any string that ends with end
|
c[aiou]t
|
cat, cit, cot, cut
|
[dg][io][gp]
|
dig, dip, dog, dop, gig, gip, gog, gop
|
[a-d][e-i]
| 2 chars a/b/c/d followed by e/f/g/h/i
|
Note Regexes can use any combination of text and special characters, so they can look extremely complicated sometimes. Start simple.
Finding Patterns in Text
Finding substrings in text is fine, but often we want to find patterns in text, rather than literal strings. Suppose we wanted to extract numeric values, phone numbers, or web site URLs from text. How do we do that? This is where the real power of regular expressions lies.
Here is an example regex:
\s[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}[\s]
Can you guess what it might find? It is a regex for finding e-mail addresses in text. At first glance, this looks pretty daunting, so let’s break it down into its constituent parts.3 First, the regex refers only to uppercase letters (to reduce the length of the regex), so this assumes that the string to be searched has already been uppercased.
There are six elements to this regex:
1 \s
2 [A-Z0-9._%+-]+
3 @
4 [A-Z0-9.-]+
5 \.
6 [A-Z]{2,4}
7 [\s\.]
| Leading whitespace One or more characters
@ character
A-Z, 0-9.- Dot character 2 to 4 text characters Whitespace or full stop |
Obviously, you need to know the rules for the pattern you search for and there are specific rules for the construction of e-mail addresses.
Here is the file remail.py.
1 import re # The RegEx library
2 #
3 # our regular expression (to find e-mails)
4 # and text to search
5 #
6 regex = '\s[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}[\s]'
7 text="""This is some text with x@y.z embedded e-mails
8 that we'll use as@example.com
9 some lines have no email addresses
10 others@have.two valid email@addresses.com
11 The re module is awonderful@thing."""
12 print('** Search text ***\n'+text)
13 print('** Regex ***\n'+regex+'\n***')
14 #
15 # uppercase our text
16 utext=text.upper()
17 #
18 # perform a search (any emails found?)
19 s = re.search(regex,utext)
20 if s:
21 print('*** At least one email found "'+s.group()+'"')
22 #
23 # now, find all matches
24 #
25 m = re.findall(regex,utext)
26 if m:
27 for match in m:
28 print('Match found',match.strip())
- Line 1 imports the modules we need.
- Lines 6 through 13 define the text string to search and the regex we will use, then print them both.
- Line 16 uppercases the text.
- Lines 19 through 21 perform the simple search for the first (any) e-mail and print the result. Note that a match contains leading and trailing whitespace.
- Lines 25 through 28 find all matches in the text and print the results.
Note that the regex matches the e-mail address and the whitespace boundaries. In Line 21 we print the match including the trailing newline, but in line 28 we strip off the spare characters.
What do we get when we run this code? Here is the result.
D:\LeanPython\programs\Python3>python remail.py
** Search text ***
This is some text with x@y.z embedded emails
that we'll use as@example.com
some lines have no email addresses
others@have.two valid email@addresses.com
The re module is awonderful@thing.
** Regex ***
\s[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}[\s]
***
*** At least one email found " AS@EXAMPLE.COM
"
Match found AS@EXAMPLE.COM
Match found OTHERS@HAVE.TWO
Match found EMAIL@ADDRESSES.COM
Capturing Parentheses
One more aspect we should mention is the use of parentheses. They can be searched for, like any other character, but they can also be used to delineate substrings that are matched, and the re module can capture these substrings and place them in a list returned by the search process. These so-called capturing parentheses feature in the following example and provide the URLs we want to extract from a page of HTML.
Finding Links in HTML
The following program downloads a single web page using the
urllib
library. The text of the downloaded HTML content is then searched using a complicated regular expression that extracts text links and provides the URL and the text of the link as seen by the user.
This program is called regex.py.
1 import urllib.request
2 import re # The RegEx library
3 #
4 # this code opens a connection to the leanpy.com website
5 #
6 response = urllib.request.urlopen('http://leanpy.com')
7 data1 = str(response.read()) # put response text in data
8 #
9 # our regular expression (to find links)
10 #
11 regex = '<a\s[^>]*href\s*=\s*\"([^\"]*)\"[^>]*>(.*?)</a>'
12 #
13 # compile the regex and perform the match (find all)
14 #
15 pm = re.compile(regex)
16 matches = pm.findall(data1)
17 #
18 # matches is a list
19 # m[0] - the url of the link
20 # m[1] - text associated with the link
21 #
22 for m in matches:
23 ms=''.join(('Link: "',m[0],'" Text: "',m[1],'"'))
24 print(ms)
The output of this program is shown here.
1 D:\LeanPython\programs>python re.py
2 200 OK
3 Link: "http://leanpy.com/" Text: "Lean Python
"
4 Link: "#content" Text: "Skip to content"
5 Link: "http://leanpy.com/" Text: "Home"
6 Link: "http://leanpy.com/?page_id=33" Text: "About Lean Python
"
7 Link: "http://leanpy.com/" Text: "<img src="http://leanpy.com/wp-content/uploads/2014/04/cropped-LeanPythonHeader.jpg" class="header-image" width="950" height="247" alt="" />"
8 Link: "http://leanpy.com/?p=1" Text: "The Lean Python
Pocketbook"
9 Link: "http://leanpy.com/?p=1#respond" Text: "<span class="leave-reply">Leave a reply</span>"
10 Link: "http://leanpy.com/wp-content/uploads/2014/04/OnePieceCover1-e1396444631642.jpg" Text: "<img class="wp-image-17 alignleft" alt="OnePieceCover" src="http://leanpy.com/wp-content/uploads/2014/04/OnePieceCover1-e1396444631642-633x1024.jpg" width="305" height="491" />"
11 Link: "http://leanpy.com/?cat=3" Text: "Lean Python
Book"
12 Link: "http://leanpy.com/?tag=book" Text: "Book"
13 Link: "http://leanpy.com/?p=1" Text: "<time class="entry-date" datetime="2014-04-02T12:06:06+00:00">April 2, 2014</time>"
14 Link: "http://leanpy.com/?author=1" Text: "paulg"
15 Link: "http://leanpy.com/?p=1" Text: "The Lean Python
Pocketbook"
16 Link: "http://leanpy.com/?cat=3" Text: "Lean Python
Book"
17 Link: "http://leanpy.com/wp-login.php?action=register" Text: "Register"
18 Link: "http://leanpy.com/wp-login.php" Text: "Log in"
19 Link: "http://leanpy.com/?feed=rss2" Text: "Entries <abbr title="Really Simple Syndication">RSS</abbr>"
20 Link: "http://leanpy.com/?feed=comments-rss2" Text: "Comments <abbr title="Really Simple Syndication">RSS</abbr>"
21 Link: "http://wordpress.org/" Text: "WordPress.org"
22 Link: "http://wordpress.org/" Text: "Proudly powered by WordPress"
You can see that the program identifies all the links, but isn’t yet as smart as we might like.
- Line 4: This link uses a bookmark to the same page.
- Line 7: The link text is actually an image (do we need to worry about that?).
Perhaps you could improve on the regex used, as an exercise.
Footnotes
1
The full documentation of the Python re module can be found at
https://docs.python.org/3/library/re.html
. Regular expressions are an advanced topic in any programming language.
3
Note that this e-mail finder regex is not perfect. It would not find an address at the start of a string and it would ignore e-mail addresses with more than four characters in the trailing element (e.g., '.mobile').