Paul GerrardLean Python10.1007/978-1-4842-2385-7_10

10. Searching

Paul Gerrard¹

(1)

Maidenhead, Berkshire, UK

Searching for Strings

Searching for text in strings is a common activity and the built-in string function find() is all you need for simple searches. It returns the position (offset) of the find or –1 if not found.

>>> txt="The quick brown fox jumps over the lazy dog"

>>> txt.find('jump')

>>> txt.find('z')

>>> txt.find('green')

-1

Introducing Regular Expressions¹

Regular expressions ² are a way of using pattern matching to find the text we are interested in. Not only are patterns matched, but the re module can extract the data we really want out of the matched text.

Many more examples could be written, and in fact there are whole books written about regular expressions (e.g., [16], [17]). There are many web sites, but the most useful is probably http://www.regular-expressions.info .

Note

A regex is a string containing both text and special characters that define a pattern that the re functions can use for matching.

Simple Searches

The simplest regex is a text string that you want to find in another string, as shown in Table 10-1.

Table 10-1.

Finding a Simple String

Regex	String Matched
jumps	jumps
The Queen	The Queen
Pqr123	Pqr123

Using Special Characters

There are special characters, listed in Table 10-2, that influence how the match is to be performed.

Table 10-2.

Using Special Characters

Symbols	Description	Example
literal	Match a literal string	Jumps
re1\|re2	Match string re1 OR re2	Yes\|No
.	Match any single character (except \n)	J.mps
^	Match start of string	^The
$	Match end of string	well$
*	Match 0 or more occurrences of preceding regex	[A-Z]*
+	Match 1 or more occurrences of preceding regex	[A-Z]+
?	Match 0 or 1 occurrences of preceding regex	[a-z0-9]?
{m,n}	Match between m and n occurrences of the preceding regex (n optional)	[0-9]{2,4}
[...]	Match any character from character class	[aeiou]
[x-y]	Match any character from range	[0-9],[A-Za-z]
[^...]	Do not match any character from character class	[^aeiou]

There are a number of special characters, listed in Table 10-3, that can be matched, too.

Table 10-3.

Searching with Special Characters

Special Character	Description	Example
\d	Match any decimal digit	BBC\d
\w	Match any alphanumeric character	Radio\w+
\s	Match any whitespace character	The\sBBC

Table 10-4 gives some examples of regular expressions and the strings that they would match.

Table 10-4.

Regular Expressions and Matching Strings

Regex	String(s) Matched
smith\|jones	smith, jones
UNE..O	Any two characters between UN and O; e.g., UNESCO, UNEzyO, UNE99O
^The	Any string that starts with The
end$	Any string that ends with end
c[aiou]t	cat, cit, cot, cut
[dg][io][gp]	dig, dip, dog, dop, gig, gip, gog, gop
[a-d][e-i]	2 chars a/b/c/d followed by e/f/g/h/i

Note Regexes can use any combination of text and special characters, so they can look extremely complicated sometimes. Start simple.

Finding Patterns in Text

Finding substrings in text is fine, but often we want to find patterns in text, rather than literal strings. Suppose we wanted to extract numeric values, phone numbers, or web site URLs from text. How do we do that? This is where the real power of regular expressions lies.

Here is an example regex:

\s[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}[\s]

Can you guess what it might find? It is a regex for finding e-mail addresses in text. At first glance, this looks pretty daunting, so let’s break it down into its constituent parts.³ First, the regex refers only to uppercase letters (to reduce the length of the regex), so this assumes that the string to be searched has already been uppercased.

There are six elements to this regex:

1 \s

2 [A-Z0-9._%+-]+

3 @

4 [A-Z0-9.-]+

5 \.

6 [A-Z]{2,4}

7 [\s\.]

Leading whitespace

One or more characters

@ character

A-Z, 0-9.-

Dot character

2 to 4 text characters

Whitespace or full stop

Obviously, you need to know the rules for the pattern you search for and there are specific rules for the construction of e-mail addresses.

Here is the file remail.py.

1 import re # The RegEx library

2 #

3 # our regular expression (to find e-mails)

4 # and text to search

5 #

6 regex = '\s[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}[\s]'

7 text="""This is some text with x@y.z embedded e-mails

8 that we'll use as@example.com

9 some lines have no email addresses

10 others@have.two valid email@addresses.com

11 The re module is awonderful@thing."""

12 print('** Search text ***\n'+text)

13 print('** Regex ***\n'+regex+'\n***')

14 #

15 # uppercase our text

16 utext=text.upper()

17 #

18 # perform a search (any emails found?)

19 s = re.search(regex,utext)

20 if s:

21 print('*** At least one email found "'+s.group()+'"')

22 #

23 # now, find all matches

24 #

25 m = re.findall(regex,utext)

26 if m:

27 for match in m:

28 print('Match found',match.strip())

Line 1 imports the modules we need.
Lines 6 through 13 define the text string to search and the regex we will use, then print them both.
Line 16 uppercases the text.
Lines 19 through 21 perform the simple search for the first (any) e-mail and print the result. Note that a match contains leading and trailing whitespace.
Lines 25 through 28 find all matches in the text and print the results.

Note that the regex matches the e-mail address and the whitespace boundaries. In Line 21 we print the match including the trailing newline, but in line 28 we strip off the spare characters.

What do we get when we run this code? Here is the result.

D:\LeanPython\programs\Python3>python remail.py

** Search text ***

This is some text with x@y.z embedded emails

that we'll use as@example.com

some lines have no email addresses

others@have.two valid email@addresses.com

The re module is awonderful@thing.

** Regex ***

\s[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}[\s]

***

*** At least one email found " AS@EXAMPLE.COM

Match found AS@EXAMPLE.COM

Match found OTHERS@HAVE.TWO

Match found EMAIL@ADDRESSES.COM

Capturing Parentheses

One more aspect we should mention is the use of parentheses. They can be searched for, like any other character, but they can also be used to delineate substrings that are matched, and the re module can capture these substrings and place them in a list returned by the search process. These so-called capturing parentheses feature in the following example and provide the URLs we want to extract from a page of HTML.

Finding Links in HTML

The following program downloads a single web page using the urllib library. The text of the downloaded HTML content is then searched using a complicated regular expression that extracts text links and provides the URL and the text of the link as seen by the user.

This program is called regex.py.

1 import urllib.request

2 import re # The RegEx library

3 #

4 # this code opens a connection to the leanpy.com website

5 #

6 response = urllib.request.urlopen('http://leanpy.com')

7 data1 = str(response.read()) # put response text in data

8 #

9 # our regular expression (to find links)

10 #

11 regex = '<a\s[^>]*href\s*=\s*\"([^\"]*)\"[^>]*>(.*?)</a>'

12 #

13 # compile the regex and perform the match (find all)

14 #

15 pm = re.compile(regex)

16 matches = pm.findall(data1)

17 #

18 # matches is a list

19 # m[0] - the url of the link

20 # m[1] - text associated with the link

21 #

22 for m in matches:

23 ms=''.join(('Link: "',m[0],'" Text: "',m[1],'"'))

24 print(ms)

The output of this program is shown here.

1 D:\LeanPython\programs>python re.py

2 200 OK

3 Link: "http://leanpy.com/" Text: "Lean Python "

4 Link: "#content" Text: "Skip to content"

5 Link: "http://leanpy.com/" Text: "Home"

6 Link: "http://leanpy.com/?page_id=33" Text: "About Lean Python "

7 Link: "http://leanpy.com/" Text: "<img src="http://leanpy.com/wp-content/uploads/2014/04/cropped-LeanPythonHeader.jpg" class="header-image" width="950" height="247" alt="" />"

8 Link: "http://leanpy.com/?p=1" Text: "The Lean Python Pocketbook"

9 Link: "http://leanpy.com/?p=1#respond" Text: "<span class="leave-reply">Leave a reply</span>"

10 Link: "http://leanpy.com/wp-content/uploads/2014/04/OnePieceCover1-e1396444631642.jpg" Text: "<img class="wp-image-17 alignleft" alt="OnePieceCover" src="http://leanpy.com/wp-content/uploads/2014/04/OnePieceCover1-e1396444631642-633x1024.jpg" width="305" height="491" />"

11 Link: "http://leanpy.com/?cat=3" Text: "Lean Python Book"

12 Link: "http://leanpy.com/?tag=book" Text: "Book"

13 Link: "http://leanpy.com/?p=1" Text: "<time class="entry-date" datetime="2014-04-02T12:06:06+00:00">April 2, 2014</time>"

14 Link: "http://leanpy.com/?author=1" Text: "paulg"

15 Link: "http://leanpy.com/?p=1" Text: "The Lean Python Pocketbook"

16 Link: "http://leanpy.com/?cat=3" Text: "Lean Python Book"

17 Link: "http://leanpy.com/wp-login.php?action=register" Text: "Register"

18 Link: "http://leanpy.com/wp-login.php" Text: "Log in"

19 Link: "http://leanpy.com/?feed=rss2" Text: "Entries <abbr title="Really Simple Syndication">RSS</abbr>"

20 Link: "http://leanpy.com/?feed=comments-rss2" Text: "Comments <abbr title="Really Simple Syndication">RSS</abbr>"

21 Link: "http://wordpress.org/" Text: "WordPress.org"

22 Link: "http://wordpress.org/" Text: "Proudly powered by WordPress"

You can see that the program identifies all the links, but isn’t yet as smart as we might like.

Line 4: This link uses a bookmark to the same page.
Line 7: The link text is actually an image (do we need to worry about that?).

Perhaps you could improve on the regex used, as an exercise.

Footnotes

The full documentation of the Python re module can be found at https://docs.python.org/3/library/re.html . Regular expressions are an advanced topic in any programming language.

Often, regular expression is shortened to regex .

Note that this e-mail finder regex is not perfect. It would not find an address at the start of a string and it would ignore e-mail addresses with more than four characters in the trailing element (e.g., '.mobile').

Table of Contents for
Lean Python: Learn Just Enough Python to Build Useful Tools

10. Searching

Searching for Strings

More Complex Searches

Introducing Regular Expressions¹

Simple Searches

Using Special Characters

Finding Patterns in Text

Capturing Parentheses

Finding Links in HTML

Table of Contents for Lean Python: Learn Just Enough Python to Build Useful Tools

10. Searching

Searching for Strings

More Complex Searches

Introducing Regular Expressions1

Simple Searches

Using Special Characters

Finding Patterns in Text

Capturing Parentheses

Finding Links in HTML

Table of Contents for
Lean Python: Learn Just Enough Python to Build Useful Tools

Introducing Regular Expressions¹