Chapter 15. Testing Your Website with Scrapers

When working with web projects that have a large development stack, it’s often only the “back” of the stack that ever gets tested regularly. Most programming languages today (including Python) have some type of test framework, but website frontends are often left out of these automated tests, although they might be the only customer-facing part of the project.

Part of the problem is that websites are often a mishmash of many markup languages and programming languages. You can write unit tests for sections of your JavaScript, but it’s useless if the HTML it’s interacting with has changed in such a way that the JavaScript doesn’t have the intended action on the page, even if it’s working correctly.

The problem of frontend website testing has often been left as an afterthought, or delegated to lower-level programmers armed with, at most, a checklist and a bug tracker. However, with just a little more upfront effort, you can replace this checklist with a series of unit tests, and replace human eyes with a web scraper.

Imagine: test-driven development for web development. Daily tests to make sure all parts of the web interface are functioning as expected. A suite of tests run every time someone adds a new website feature, or changes the position of an element. This chapter covers the basics of testing and how to test all sorts of websites, from simple to complicated, with Python-based web scrapers.

An Introduction to Testing

If you’ve never written tests for your code before, there’s no better time to start than now. Having a suite of tests that can be run to ensure that your code performs as expected (at least, as far as you’ve written tests for) saves you time and worry and makes releasing new updates easy.

What Are Unit Tests?

The words test and unit test are often used interchangeably. Often, when programmers refer to “writing tests,” what they really mean is “writing unit tests.” On the other hand, when some programmers refer to writing unit tests, they’re really writing some other kind of test.

Although definitions and practices tend to vary from company to company, a unit test generally has the following characteristics:

Each unit test tests one aspect of the functionality of a component. For example, it might ensure that the appropriate error message is thrown if a negative number of dollars is withdrawn from a bank account.

Often, unit tests are grouped together in the same class, based on the component they are testing. You might have the test for a negative dollar value being withdrawn from a bank account, followed by a unit test for the behavior of an overdrawn bank account.
Each unit test can be run completely independently, and any setup or teardown required for the unit test must be handled by the unit test itself. Similarly, unit tests must not interfere with the success or failure of other tests, and they must be able to run successfully in any order.
Each unit test usually contains at least one assertion. For example, a unit test might assert that the answer to 2 + 2 is 4. Occasionally, a unit test might contain only a failure state. For example, it might fail if an exception is thrown, but pass by default if everything goes smoothly.
Unit tests are separated from the bulk of the code. Although they necessarily need to import and use the code they are testing, they are generally kept in separate classes and directories.

Although many other types of tests can be written—integration tests and validation tests, for example—this chapter primarily focuses on unit testing. Not only have unit tests become extremely popular, with recent pushes toward test-driven development, but their length and flexibility make them easy to work with as examples, and Python has some built-in unit testing capabilities, as you’ll see in the next section.

Python unittest

Python’s unit-testing module, unittest, comes packaged with all standard Python installations. Just import and extend unittest.TestCase, and it will do the following:

Provide setUp and tearDown functions that run before and after each unit test
Provide several types of “assert” statements to allow tests to pass or fail
Run all functions that begin with test_ as unit tests, and ignore functions that are not prepended as tests

The following provides a simple unit test for ensuring that 2 + 2 = 4, according to Python:

import unittest

class TestAddition(unittest.TestCase):
    def setUp(self):
        print('Setting up the test')

    def tearDown(self):
        print('Tearing down the test')

    def test_twoPlusTwo(self):
        total = 2+2
        self.assertEqual(4, total);

if __name__ == '__main__':
    unittest.main()

Although setUp and tearDown don’t provide any useful functionality here, they are included for the purposes of illustration. Note that these functions are run before and after each individual test, not before and after all the tests in the class.

The output of the test function, when run from the command line, should look like this:

Setting up the test
Tearing down the test
.
----------------------------------------------------------------------
Ran 1 test in 0.000s

OK

This indicates that the test ran successfully, and 2 + 2 does indeed equal 4.

Running unittest in Jupyter Notebooks

The unit test scripts in this chapter are all kicked off with:

if __name__ == '__main__':
    unittest.main()

The line if __name__ == '__main__' is true only if the line is executed directly in Python, and not via an import statement. This allows you to run your unit test, using the unittest.TestCase class that it extends, directly from the command line.

In a Jupyter notebook, things are a little bit different. The argv parameters created by Jupyter can cause errors in the unit test, and, because the unittest framework exits Python by default after the test is run (which causes problems in the notebook kernel), we also need to prevent that from happening.

In the Jupyter notebooks, you will use the following to launch unit tests:

if __name__ == '__main__':
    unittest.main(argv=[''], exit=False)
    %reset

The second line sets all of the argv variables (command-line arguments) to a single empty string, which is ignored by unnittest.main. It also prevents unittest from exiting after the test is run.

The %reset line is useful because it resets the memory and destroys all user-created variables in the Jupyter notebook. Without it, each unit test you write in the notebook will contain all of the methods from all other previously run tests that also inherited unittest.TestCase, including setUp and tearDown methods. This also means that each unit test would run all of the methods from the unit tests before it!

Using %reset does create one extra manual step for the user when running the tests. When running the test, the notebook will prompt the user and ask if they’re sure they want to reset the memory. Simply type y and hit Enter to do this.

Testing Wikipedia

Testing the frontend of your website (excluding JavaScript, which we’ll cover next) is as simple as combining the Python unittest library with a web scraper:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import unittest

class TestWikipedia(unittest.TestCase):
    bs = None
    def setUpClass():
        url = 'http://en.wikipedia.org/wiki/Monty_Python'
        TestWikipedia.bs = BeautifulSoup(urlopen(url), 'html.parser')

    def test_titleText(self):
        pageTitle = TestWikipedia.bs.find('h1').get_text()
        self.assertEqual('Monty Python', pageTitle);

    def test_contentExists(self):
        content = TestWikipedia.bs.find('div',{'id':'mw-content-text'})
        self.assertIsNotNone(content)


if __name__ == '__main__':
    unittest.main()

There are two tests this time: the first tests whether the title of the page is the expected “Monty Python,” and the second makes sure that the page has a content div.

Note that the content of the page is loaded only once, and that the global object bs is shared between tests. This is accomplished by using the unittest-specified function setUpClass, which is run only once at the start of the class (unlike setUp, which is run before every individual test). Using setUpClass instead of setUp saves unnecessary page loads; you can grab the content once and run multiple tests on it.

One major architectural difference between setUpClass and setUp, besides just when and how often they’re run, is that setUpClass is a static method that “belongs” to the class itself and has global class variables, whereas setUp is an instance function that belongs to a particular instance of the class. This is why setUp can set attributes on self—the particular instance of that class—while setUpClass can access only static class attributes on the class TestWikipedia.

Although testing a single page at a time might not seem all that powerful or interesting, as you may recall from Chapter 3, it is relatively easy to build web crawlers that can iteratively move through all pages of a website. What happens when you combine a web crawler with a unit test that makes an assertion about each page?

There are many ways to run a test repeatedly, but you must be careful to load each page only once for each set of tests you want to run on the page, and you must also avoid holding large amounts of information in memory at once. The following setup does just that:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import unittest
import re
import random
from urllib.parse import unquote

class TestWikipedia(unittest.TestCase):

    def test_PageProperties(self):
        self.url = 'http://en.wikipedia.org/wiki/Monty_Python'
        #Test the first 10 pages we encounter
        for i in range(1, 10):
            self.bs = BeautifulSoup(urlopen(self.url), 'html.parser')
            titles = self.titleMatchesURL()
            self.assertEquals(titles[0], titles[1])
            self.assertTrue(self.contentExists())
            self.url = self.getNextLink()
        print('Done!')

    def titleMatchesURL(self):
        pageTitle = self.bs.find('h1').get_text()
        urlTitle = self.url[(self.url.index('/wiki/')+6):]
        urlTitle = urlTitle.replace('_', ' ')
        urlTitle = unquote(urlTitle)
        return [pageTitle.lower(), urlTitle.lower()]

    def contentExists(self):
        content = self.bs.find('div',{'id':'mw-content-text'})
        if content is not None:
            return True
        return False

    def getNextLink(self):
        #Returns random link on page, using technique from Chapter 3
        links = self.bs.find('div', {'id':'bodyContent'}).find_all(
            'a', href=re.compile('^(/wiki/)((?!:).)*$'))
        randomLink = random.SystemRandom().choice(links)
        return 'https://wikipedia.org{}'.format(randomLink.attrs['href'])

if __name__ == '__main__':
    unittest.main()

There are a few things to notice. First, there is only one actual test in this class. The other functions are technically only helper functions, even though they’re doing the bulk of the computational work to determine whether a test passes. Because the test function performs the assertion statements, the results of the test are passed back to the test function where the assertions happen.

Also, while contentExists returns a boolean, titleMatchesURL returns the values themselves back for evaluation. To see why you would want to pass values back rather than just a boolean, compare the results of a boolean assertion:

======================================================================
FAIL: test_PageProperties (__main__.TestWikipedia)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "15-3.py", line 22, in test_PageProperties
    self.assertTrue(self.titleMatchesURL())
AssertionError: False is not true

with the results of an assertEquals statement:

======================================================================
FAIL: test_PageProperties (__main__.TestWikipedia)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "15-3.py", line 23, in test_PageProperties
    self.assertEquals(titles[0], titles[1])
AssertionError: 'lockheed u-2' != 'u-2 spy plane'

Which one is easier to debug? (In this case, the error is occurring because of a redirect, when the article http://wikipedia.org/wiki/u-2%20spy%20plane redirects to an article titled “Lockheed U-2.”)

Testing with Selenium

As with Ajax scraping in Chapter 11, JavaScript presents particular challenges when doing website testing. Fortunately, Selenium has an excellent framework in place for handling particularly complicated websites; in fact, the library was originally designed for website testing!

Although obviously written in the same language, the syntax of Python unit tests and Selenium unit tests have surprisingly little in common. Selenium does not require that its unit tests be contained as functions within classes; its assert statements do not require parentheses; and tests pass silently, producing some kind of message only on a failure:

driver = webdriver.Chrome()
driver.get('http://en.wikipedia.org/wiki/Monty_Python')
assert 'Monty Python' in driver.title
driver.close()

When run, this test should produce no output.

In this way, Selenium tests can be written more casually than Python unit tests, and assert statements can even be integrated into regular code, where it is desirable for code execution to terminate if some condition is not met.

Interacting with the Site

Recently, I wanted to contact a local small business through its website’s contact form but found that the HTML form was broken; nothing happened when I clicked the submit button. After a little investigation, I saw they were using a simple mailto form that was designed to send them an email with the form’s contents. Fortunately, I was able to use this information to send them an email, explain the problem with their form, and hire them, despite the technical issue.

If I were to write a traditional scraper that used or tested this form, my scraper would likely just copy the layout of the form and send an email directly—bypassing the form altogether. How could I test the functionality of the form and ensure that it was working perfectly through a browser?

Although previous chapters have discussed navigating links, submitting forms, and other types of interaction-like activity, at its core everything we’ve done is designed to bypass the browser interface, not use it. Selenium, on the other hand, can literally enter text, click buttons, and do everything through the browser (in this case, the headless Chrome browser), and detect things like broken forms, badly coded JavaScript, HTML typos, and other issues that might stymie actual customers.

Key to this sort of testing is the concept of Selenium elements. This object was briefly encountered in Chapter 11, and is returned by calls like this:

usernameField = driver.find_element_by_name('username')

Just as there are numerous actions you can take on various elements of a website in your browser, there are many actions Selenium can perform on any given element. Among these are the following:

myElement.click()
myElement.click_and_hold()
myElement.release()
myElement.double_click()
myElement.send_keys_to_element('content to enter')

In addition to performing a one-time action on an element, strings of actions can be combined into action chains, which can be stored and executed once or multiple times in a program. Action chains are useful in that they can be a convenient way to string long sets of multiple actions, but they are functionally identical to calling the action explicitly on the element, as in the preceding examples.

To see this difference, take a look at the form page at http://pythonscraping.com/pages/files/form.html (which was previously used as an example in Chapter 10). We can fill out the form and submit it in the following way:

from selenium import webdriver
from selenium.webdriver.remote.webelement import WebElement
from selenium.webdriver.common.keys import Keys
from selenium.webdriver import ActionChains
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument('--headless')

driver = webdriver.Chrome(
    executable_path='drivers/chromedriver', options=chrome_options)
driver.get('http://pythonscraping.com/pages/files/form.html')

firstnameField = driver.find_element_by_name('firstname')
lastnameField = driver.find_element_by_name('lastname')
submitButton = driver.find_element_by_id('submit')

### METHOD 1 ###
#firstnameField.send_keys('Ryan')
lastnameField.send_keys('Mitchell')
submitButton.click()
################

### METHOD 2 ###
actions = ActionChains(driver).click(firstnameField)
    .send_keys('Ryan')
    .click(lastnameField)
    .send_keys('Mitchell')
    .send_keys(Keys.RETURN)
actions.perform()
################

print(driver.find_element_by_tag_name('body').text)

driver.close()

Method 1 calls send_keys on the two fields and then clicks the submit button. Method 2 uses a single action chain to click and enter text in each field, which happens in a sequence after the perform method is called. This script operates in the same way, whether the first method or the second method is used, and prints this line:

Hello there, Ryan Mitchell!

There is another variation in the two methods, in addition to the objects they use to handle the commands: notice that the first method clicks the Submit button, while the second uses the Return keystroke to submit the form while the text box is submitted. Because there are many ways to think about the sequence of events that complete the same action, there are many ways to complete the same action using Selenium.

Drag and drop

Clicking buttons and entering text is one thing, but where Selenium really shines is in its ability to deal with relatively novel forms of web interaction. Selenium allows for the manipulation of drag-and-drop interfaces with ease. Using its drag-and-drop function requires you to specify a source element (the element to be dragged) and either an offset to drag it across, or a target element to drag it to.

The demo page located at http://pythonscraping.com/pages/javascript/draggableDemo.html presents an example of this type of interface:

from selenium import webdriver
from selenium.webdriver.remote.webelement import WebElement
from selenium.webdriver import ActionChains
from selenium.webdriver.chrome.options import Options
import unittest

class TestAddition(unittest.TestCase):
    driver = None

    def setUp(self):
        chrome_options = Options()
        chrome_options.add_argument('--headless')
        self.driver = webdriver.Chrome(
            executable_path='drivers/chromedriver', options=chrome_options)
        url = 'http://pythonscraping.com/pages/javascript/draggableDemo.html'
        self.driver.get(url)

    def tearDown(self):
        driver.close()

    def test_drag(self):
        element = self.driver.find_element_by_id('draggable')
        target = self.driver.find_element_by_id('div2')
        actions = ActionChains(self.driver)
        actions.drag_and_drop(element, target).perform()
        self.assertEqual('You are definitely not a bot!',
                         self.driver.find_element_by_id('message').text)

Two messages are printed out from the message div on the demo page. The first says

Prove you are not a bot, by dragging the square from the blue area to the red 
area!

Then, quickly, after the task is completed, the content is printed out again, which now reads

You are definitely not a bot!

As the demo page suggests, dragging elements to prove you’re not a bot is a common theme in many CAPTCHAs. Although bots have been able to drag objects around for a long time (it’s just a matter of clicking, holding, and moving), somehow the idea of using “drag this” as a verification of humanity just won’t die.

In addition, these draggable CAPTCHA libraries rarely use any difficult-for-bots tasks, like “drag the picture of the kitten onto the picture of the cow” (which requires you to identify the pictures as “a kitten” and “a cow,” while parsing instructions); instead, they often involve number ordering or some other fairly trivial task like the one in the preceding example.

Of course, their strength lies in the fact that there are so many variations, and they are so infrequently used; no one will likely bother making a bot that can defeat all of them. At any rate, this example should be enough to illustrate why you should never use this technique for large-scale websites.

Taking screenshots

In addition to the usual testing capabilities, Selenium has an interesting trick up its sleeve that might make your testing (or impressing your boss) a little easier: screenshots. Yes, photographic evidence can be created from unit tests run without the need for actually pressing the PrtScn key:

driver = webdriver.Chrome()
driver.get('http://www.pythonscraping.com/')
driver.get_screenshot_as_file('tmp/pythonscraping.png')

This script navigates to http://pythonscraping.com and then stores a screenshot of the home page in the local tmp folder (the folder must already exist for this to store correctly). Screenshots can be saved as a variety of image formats.

unittest or Selenium?

The syntactical rigor and verboseness of Python unittest might be desirable for most large test suites, while the flexibility and power of a Selenium test might be your only option for testing some website features. So which to use?

Here’s the secret: you don’t have to choose. Selenium can easily be used to obtain information about a website, and unittest can evaluate whether that information meets the criteria for passing the test. There is no reason you can’t import Selenium tools into Python unittest, combining the best of both worlds.

For example, the following script creates a unit test for a website’s draggable interface, asserting that it correctly says, “You are not a bot!” after one element has been dragged to another:

from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.chrome.options import Options
import unittest

class TestDragAndDrop(unittest.TestCase):
    driver = None
    def setUp(self):
        chrome_options = Options()
        chrome_options.add_argument('--headless')
        self.driver = webdriver.Chrome(
            executable_path='drivers/chromedriver', options=chrome_options)
        url = 'http://pythonscraping.com/pages/javascript/draggableDemo.html'
        self.driver.get(url)

    def tearDown(self):
        self.driver.close()

    def test_drag(self):
        element = self.driver.find_element_by_id('draggable')
        target = self.driver.find_element_by_id('div2')
        actions = ActionChains(self.driver)
        actions.drag_and_drop(element, target).perform()
        self.assertEqual('You are definitely not a bot!',
            self.driver.find_element_by_id('message').text)

Virtually anything on a website can be tested with the combination of Python’s unittest and Selenium. In fact, combined with some of the image-processing libraries from Chapter 13, you can even take a screenshot of the website and test on a pixel-by-pixel basis what it should contain!

Previous Chapter

14. Avoiding Scraping Traps

Next Chapter

16. Web Crawling in Parallel

Table of Contents for Web Scraping with Python, 2nd Edition