Chapter 5. Scrapy

The previous chapter presented some techniques and patterns for building large, scalable, and (most important!) maintainable web crawlers. Although this is easy enough to do by hand, many libraries, frameworks, and even GUI-based tools will do this for you, or at least try to make your life a little easier.

This chapter introduces one of the best frameworks for developing crawlers: Scrapy. During the writing of the first edition of Web Scraping with Python, Scrapy had not yet been released for Python 3.x, and its inclusion in the text was limited to a single section. Since then, the library has been updated to support Python 3.3+, additional features have been added, and I’m excited to expand this section into its own chapter.

One of the challenges of writing web crawlers is that you’re often performing the same tasks again and again: find all links on a page, evaluate the difference between internal and external links, go to new pages. These basic patterns are useful to know and to be able to write from scratch, but the Scrapy library handles many of these details for you.

Of course, Scrapy isn’t a mind reader. You still need to define page templates, give it locations to start scraping from, and define URL patterns for the pages that you’re looking for. But in these cases, it provides a clean framework to keep your code organized.

Installing Scrapy

Scrapy offers the tool for download from its website, as well as instructions for installing Scrapy with third-party installation managers such as pip.

Because of its relatively large size and complexity, Scrapy is not usually a framework that can be installed in the traditional way with

$ pip install Scrapy

Note that I say “usually” because, though it is theoretically possible, I usually run into one or more tricky dependency issues, version mismatches, and unsolvable bugs.

If you’re determined to install Scrapy from pip, using a virtual environment (see the “Keeping Libraries Straight with Virtual Environments” for more on virtual environments) is highly recommended.

The installation method that I prefer is through the Anaconda package manager. Anaconda is a product, produced by the company Continuum, designed to reduce friction when it comes to finding and installing popular Python data science packages. Many of the packages it manages, such as NumPy and NLTK, will be used in later chapters as well.

After Anaconda is installed, you can install Scrapy by using this command:

conda install -c conda-forge scrapy

If you run into issues, or need up-to-date information, check out the Scrapy Installation guide for more information.

Initializing a New Spider

Once you’ve installed the Scrapy framework, a small amount of setup needs to be done for each spider. A spider is a Scrapy project that, like its arachnid namesake, is designed to crawl webs. Throughout this chapter, I use “spider” to describe a Scrapy project in particular, and “crawler” to mean “any generic program that crawls the web, using Scrapy or not.”

To create a new spider in the current directory, run the following from the command line:

$ scrapy startproject wikiSpider

This creates a new subdirectory in the directory the project was created in, with the title wikiSpider. Inside this directory is the following file structure:

  • scrapy.cfg

  • wikiSpider

    • spiders

      • __init.py__

    • items.py

    • middlewares.py

    • pipelines.py

    • settings.py

    • __init.py__

These Python files are initialized with stub code to provide a fast means of creating a new spider project. Each section in this chapter works with this wikiSpider project.

Writing a Simple Scraper

To create a crawler, you will add a new file inside the child wikiSpider directory at wikiSpider/wikiSpider/article.py. In your newly created article.py file, write the following:

import scrapy

class ArticleSpider(scrapy.Spider):
    name='article'

    def start_requests(self):
        urls = [
            'http://en.wikipedia.org/wiki/Python_'
​    ​    ​    '%28programming_language%29',
            'https://en.wikipedia.org/wiki/Functional_programming',
            'https://en.wikipedia.org/wiki/Monty_Python']
        return [scrapy.Request(url=url, callback=self.parse)
​    ​    ​    for url in urls]

    def parse(self, response):
        url = response.url
        title = response.css('h1::text').extract_first()
        print('URL is: {}'.format(url))
        print('Title is: {}'.format(title))

The name of this class (ArticleSpider) is different from the name of the directory (wikiSpider), indicating that this class in particular is responsible for spidering through only article pages, under the broader category of wikiSpider, which you may later want to use to search for other page types.

For large sites with many types of content, you might have separate Scrapy items for each type (blog posts, press releases, articles, etc.), each with different fields, but all running under the same Scrapy project. The name of each spider must be unique within the project.

The other key things to notice about this spider are the two functions start_requests and parse

start_requests is a Scrapy-defined entry point to the program used to generate Request objects that Scrapy uses to crawl the website.

parse is a callback function defined by the user, and is passed to the Request object with callback=self.parse. Later, you’ll look at more-powerful things that can be done with the parse function, but for now it prints the title of the page.

You can run this article spider by navigating to the wikiSpider/wikiSpider directory and running:

$ scrapy runspider article.py

The default Scrapy output is fairly verbose. Along with debugging information, this should print out lines like the following:

2018-01-21 23:28:57 [scrapy.core.engine] DEBUG: Crawled (200)
<GET https://en.wikipedia.org/robots.txt> (referer: None)
2018-01-21 23:28:57 [scrapy.downloadermiddlewares.redirect]
DEBUG: Redirecting (301) to <GET https://en.wikipedia.org/wiki/
Python_%28programming_language%29> from <GET http://en.wikipedia.org/
wiki/Python_%28programming_language%29>
2018-01-21 23:28:57 [scrapy.core.engine] DEBUG: Crawled (200)
<GET https://en.wikipedia.org/wiki/Functional_programming>
(referer: None)
URL is: https://en.wikipedia.org/wiki/Functional_programming
Title is: Functional programming
2018-01-21 23:28:57 [scrapy.core.engine] DEBUG: Crawled (200)
<GET https://en.wikipedia.org/wiki/Monty_Python> (referer: None)
URL is: https://en.wikipedia.org/wiki/Monty_Python
Title is: Monty Python

The scraper goes to the three pages listed as the urls, gathers information, and then terminates.

Spidering with Rules

The spider in the previous section isn’t much of a crawler, confined to scraping only the list of URLs it’s provided. It has no ability to seek new pages on its own. To turn it into a fully fledged crawler, you need to use the CrawlSpider class provided by Scrapy.

Code Organization Within the GitHub Repository

Unfortunately, the Scrapy framework cannot be easily run from within a Jupyter notebook, making a linear progression of code difficult to capture. For the purpose of presenting all code samples in the text, the scraper from the previous section is stored in the article.py file, while the following example, creating a Scrapy spider that traverses many pages, is stored in articles.py (note the use of the plural).

Later examples will also be stored in separate files, with new filenames given in each section. Make sure you are using the correct filename when running these examples.

This class can be found in articles.py in the Github repository:
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule

class ArticleSpider(CrawlSpider):
    name = 'articles'
    allowed_domains = ['wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/'
​    ​    'Benevolent_dictator_for_life']
    rules = [Rule(LinkExtractor(allow=r'.*'), callback='parse_items',
​    ​    follow=True)]

    def parse_items(self, response):
        url = response.url
        title = response.css('h1::text').extract_first()
        text = response.xpath('//div[@id="mw-content-text"]//text()'
​    ​    ​    ).extract()
        lastUpdated = response.css('li#footer-info-lastmod::text'
​    ​    ​    ).extract_first()
        lastUpdated = lastUpdated.replace(
​    ​    ​    'This page was last edited on ', '')
        print('URL is: {}'.format(url))
        print('title is: {} '.format(title))
        print('text is: {}'.format(text))
        print('Last updated: {}'.format(lastUpdated))

This new ArticleSpider extends the CrawlSpider class. Rather than providing a start_requests function, it provides a list of start_urls and allowed_domains. This tells the spider where to start crawling from and whether it should follow or ignore a link based on the domain.

A list of rules is also provided. This provides further instructions on which links to follow or ignore (in this case, you are allowing all URLs with the regular expression .*).

In addition to extracting the title and URL on each page, a couple of new items have been added. The text content of each page is extracted using an XPath selector. XPath is often used when retrieving text content including text in child tags (for example, an <a> tag inside a block of text). If you use the CSS selector to do this, all text within child tags will be ignored.

The last updated date string is also parsed from the page footer and stored in the lastUpdated variable.

You can run this example by navigating to the wikiSpider/wikiSpider directory and running this:

$ scrapy runspider articles.py

Warning: This Will Run Forever

This spider will run from the command line in the same way as the previous one, but it will not terminate (at least not for a very, very long time) until you halt execution by using Ctrl-C or by closing the terminal. Please be kind to Wikipedia’s server load and do not run it for long.

When run, this spider traverses wikipedia.org, following all links under the domain wikipedia.org, printing titles of pages, and ignoring all external (offsite) links:

2018-01-21 01:30:36 [scrapy.spidermiddlewares.offsite]
DEBUG: Filtered offsite request to 'www.chicagomag.com':
<GET http://www.chicagomag.com/Chicago-Magazine/June-2009/
Street-Wise/>
2018-01-21 01:30:36 [scrapy.downloadermiddlewares.robotstxt]
DEBUG: Forbidden by robots.txt: <GET https://en.wikipedia.org/w/
index.php?title=Adrian_Holovaty&action=edit&section=3>
title is: Ruby on Rails
URL is: https://en.wikipedia.org/wiki/Ruby_on_Rails
text is: ['Not to be confused with ', 'Ruby (programming language)',
 '.', '\n', '\n', 'Ruby on Rails', ... ]
Last updated:  9 January 2018, at 10:32.

This is a pretty good crawler so far, but it could use a few limits. Instead of just visiting article pages on Wikipedia, it’s free to roam to nonarticle pages as well, such as:

title is: Wikipedia:General disclaimer

Let’s take a closer look at the line by using Scrapy’s Rule and LinkExtractor:

rules = [Rule(LinkExtractor(allow=r'.*'), callback='parse_items',
​    follow=True)]

This line provides a list of Scrapy Rule objects that define the rules that all links found are filtered through. When multiple rules are in place, each link is checked against the rules in order. The first rule that matches is the one that is used to determine how the link is handled. If the link doesn’t match any rules, it is ignored.

A Rule can be provided with six arguments:

link_extractor
The only mandatory argument, a LinkExtractor object.
callback
The function that should be used to parse the content on the page.
cb_kwargs
A dictionary of arguments to be passed to the callback function. This dictionary is formatted as {arg_name1: arg_value1, arg_name2: arg_value2} and can be a handy tool for reusing the same parsing functions for slightly different tasks.
follow
Indicates whether you want links found at that page to be included in a future crawl. If no callback function is provided, this defaults to True (after all, if you’re not doing anything with the page, it makes sense that you’d at least want to use it to continue crawling through the site). If a callback function is provided, this defaults to False

LinkExtractor is a simple class designed solely to recognize and return links in a page of HTML content based on the rules provided to it. It has a number of arguments that can be used to accept or deny a link based on CSS and XPath selectors, tags (you can look for links in more than just anchor tags!), domains, and more.

The LinkExtractor class can even be extended, and custom arguments can be created. See Scrapy’s documentation on link extractors for more information.

Despite all the flexible features of the LinkExtractor class, the most common arguments you’ll probably use are these:

allow
Allow all links that match the provided regular expression.
deny
Deny all links that match the provided regular expression.

Using two separate Rule and LinkExtractor classes with a single parsing function, you can create a spider that crawls Wikipedia, identifying all article pages and flagging nonarticle pages (articlesMoreRules.py):

from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule

class ArticleSpider(CrawlSpider):
    name = 'articles'
    allowed_domains = ['wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/'
​    ​    'Benevolent_dictator_for_life']
    rules = [
        Rule(LinkExtractor(allow='^(/wiki/)((?!:).)*$'),
​    ​    ​    callback='parse_items', follow=True,
​    ​    ​    cb_kwargs={'is_article': True}),
        Rule(LinkExtractor(allow='.*'), callback='parse_items',
​    ​    ​    cb_kwargs={'is_article': False})
    ]

    def parse_items(self, response, is_article):
        print(response.url)
        title = response.css('h1::text').extract_first()
        if is_article:
            url = response.url
            text = response.xpath('//div[@id="mw-content-text"]'
​    ​    ​    ​    '//text()').extract()
            lastUpdated = response.css('li#footer-info-lastmod'
​    ​    ​    ​    '::text').extract_first()
            lastUpdated = lastUpdated.replace('This page was '
​    ​    ​    ​    'last edited on ', '')
            print('Title is: {} '.format(title))
            print('title is: {} '.format(title))
            print('text is: {}'.format(text))
        else:
            print('This is not an article: {}'.format(title))

Recall that the rules are applied to each link in the order that they are presented in the list. All article pages (pages that start with /wiki/ and do not contain a colon) are passed to the parse_items function first with the default parameter is_article=True. Then all the other nonarticle links are passed to the parse_items function with the argument is_article=False.

Of course, if you’re looking to collect only article-type pages and ignore all others, this approach would be impractical. It would be much easier to ignore pages that don’t match the article URL pattern and leave out the second rule (and the is_article variable) altogether. However, this type of approach may be useful in odd cases where information from the URL, or information collected during crawling, impacts the way the page should be parsed.

Creating Items

So far, you’ve looked at many ways of finding, parsing, and crawling websites with Scrapy, but Scrapy also provides useful tools to keep your collected items organized and stored in custom objects with well-defined fields.

To help organize all the information you’re collecting, you need to create an Article object. Define a new item called Article inside the items.py file.

When you open the items.py file, it should look like this:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class WikispiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

Replace this default Item stub with a new Article class extending scrapy.Item:

import scrapy

class Article(scrapy.Item):
    url = scrapy.Field()
    title = scrapy.Field()
    text = scrapy.Field()
    lastUpdated = scrapy.Field()

You are defining three fields that will be collected from each page: a title, URL, and the date the page was last edited.

If you are collecting data for multiple page types, you should define each separate type as its own class in items.py. If your items are large, or you start to move more parsing functionality into your item objects, you may also wish to extract each item into its own file. While the items are small, however, I like to keep them in a single file.

In the file articleItems.py note the the changes that were made to the ArticleSpider class in order to create the new Article item:

from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from wikiSpider.items import Article

class ArticleSpider(CrawlSpider):
    name = 'articleItems'
    allowed_domains = ['wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/Benevolent'
​    ​    '_dictator_for_life']
    rules = [
        Rule(LinkExtractor(allow='(/wiki/)((?!:).)*$'),
​    ​    ​    callback='parse_items', follow=True),
    ]

    def parse_items(self, response):
        article = Article()
        article['url'] = response.url
        article['title'] = response.css('h1::text').extract_first()
        article['text'] = response.xpath('//div[@id='
​    ​    ​    '"mw-content-text"]//text()').extract()
        lastUpdated = response.css('li#footer-info-lastmod::text'
​    ​    ​    ).extract_first()
        article['lastUpdated'] = lastUpdated.replace('This page was '
​    ​    ​    'last edited on ', '')
        return article

When this file is run with

$ scrapy runspider articleItems.py

it will output the usual Scrapy debugging data along with each article item as a Python dictionary:

2018-01-21 22:52:38 [scrapy.spidermiddlewares.offsite] DEBUG:
Filtered offsite request to 'wikimediafoundation.org':
<GET https://wikimediafoundation.org/wiki/Terms_of_Use>
2018-01-21 22:52:38 [scrapy.core.engine] DEBUG: Crawled (200)
<GET https://en.wikipedia.org/wiki/Benevolent_dictator_for_life
#mw-head> (referer: https://en.wikipedia.org/wiki/Benevolent_
dictator_for_life)
2018-01-21 22:52:38 [scrapy.core.scraper] DEBUG: Scraped from
<200 https://en.wikipedia.org/wiki/Benevolent_dictator_for_life>
{'lastUpdated': ' 13 December 2017, at 09:26.',
'text': ['For the political term, see ',
          'Benevolent dictatorship',
          '.',
          ...

Using Scrapy Items isn’t just for promoting good code organization or laying things out in a readable way. Items provide many tools for outputting and processing data, covered in the next sections.

Outputting Items

Scrapy uses the Item objects to determine which pieces of information it should save from the pages it visits. This information can be saved by Scrapy in a variety of ways, such as CSV, JSON, or XML files, using the following commands:

$ scrapy runspider articleItems.py -o articles.csv -t csv
$ scrapy runspider articleItems.py -o articles.json -t json
$ scrapy runspider articleItems.py -o articles.xml -t xml

Each of these runs the scraper articleItems and writes the output in the specified format to the provided file. This file will be created if it does not exist already.

You may have noticed that in the articles spider created in previous examples, the text variable is a list of strings rather than a single string. Each string in this list represents text inside a single HTML element, whereas the content inside <div id="mw-content-text">, from which you are collecting the text data, is composed of many child elements.

Scrapy manages these more complex values well. In the CSV format, for example, it converts lists to strings and escapes all commas so that a list of text displays in a single CSV cell.

In XML, each element of this list is preserved inside child value tags:

<items>
<item>
    <url>https://en.wikipedia.org/wiki/Benevolent_dictator_for_life</url>
    <title>Benevolent dictator for life</title>
    <text>
        <value>For the political term, see </value>
        <value>Benevolent dictatorship</value>
        ...
    </text>
    <lastUpdated> 13 December 2017, at 09:26.</lastUpdated>
</item>
....

In the JSON format, lists are preserved as lists.

Of course, you can use the Item objects yourself and write them to a file or a database in whatever way you want, simply by adding the appropriate code to the parsing function in the crawler.

The Item Pipeline

Although Scrapy is single threaded, it is capable of making and handling many requests asynchronously. This makes it faster than the scrapers written so far in this book, although I have always been a firm believer that faster is not always better when it comes to web scraping.

The web server for the site you are trying to scrape must handle each of these requests, and it’s important to be a good citizen and evaluate whether this sort of server hammering is appropriate (or even wise for your own self-interests, as many websites have the ability and the will to block what they might see as malicious scraping activity). For more information about the ethics of web scraping, as well as the importance of appropriately throttling scrapers, see Chapter 18.

With that said, using Scrapy’s item pipeline can improve the speed of your web scraper even further by performing all data processing while waiting for requests to be returned, rather than waiting for data to be processed before making another request. This type of optimization can sometimes even be necessary when data processing requires a great deal of time or processor-heavy calculations must be performed.

To create an item pipeline, revisit the settings.py file that was created at the beginning of the chapter. You should see the following commented lines:

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'wikiSpider.pipelines.WikispiderPipeline': 300,
#}

Uncomment the last three lines and replace with the following:

ITEM_PIPELINES = {
    'wikiSpider.pipelines.WikispiderPipeline': 300,
}

This provides a Python class, wikiSpider.pipelines.WikispiderPipeline, that will be used to process the data, as well as an integer that represents the order in which to run the pipeline if there are multiple processing classes. Although any integer can be used here, the numbers 0–1000 are typically used, and will be run in ascending order.

Now you need to add the pipeline class and rewrite your original spider so that the spider collects data and the pipeline does the heavy lifting of the data processing. It might be tempting to write the parse_items method in your original spider to return the response and let the pipeline create the Article object:

    def parse_items(self, response):
        return response

However, the Scrapy framework does not allow this, and an Item object (such as an Article, which extends Item) must be returned. So the goal of parse_items is now to extract the raw data, doing as little processing as possible, so that it can be passed to the pipeline:

from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from wikiSpider.items import Article

class ArticleSpider(CrawlSpider):
    name = 'articlePipelines'
    allowed_domains = ['wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/Benevolent_dictator_for_life']
    rules = [
        Rule(LinkExtractor(allow='(/wiki/)((?!:).)*$'),
​    ​    ​    callback='parse_items', follow=True),
    ]

    def parse_items(self, response):
        article = Article()
        article['url'] = response.url
        article['title'] = response.css('h1::text').extract_first()
        article['text'] = response.xpath('//div[@id='
​    ​    ​    '"mw-content-text"]//text()').extract()
        article['lastUpdated'] = response.css('li#'
​    ​    ​    'footer-info-lastmod::text').extract_first()
        return article

This file is saved as articlePipelines.py in the GitHub repository.

Of course, now you need to tie the pipelines.py file and the updated spider together by adding the pipeline. When the Scrapy project was first initialized, a file was created at wikiSpider/wikiSpider/pipelines.py:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html


class WikispiderPipeline(object):
    def process_item(self, item, spider):
        return item

This stub class should be replaced with your new pipeline code. In previous sections, you’ve been collecting two fields in a raw format, and these could use additional processing: lastUpdated (which is a badly formatted string object representing a date) and text (a messy array of string fragments).

The following should be used to replace the stub code in wikiSpider/wikiSpider/pipelines.py:

from datetime import datetime
from wikiSpider.items import Article
from string import whitespace

class WikispiderPipeline(object):
    def process_item(self, article, spider):
        dateStr = article['lastUpdated']
        article['lastUpdated'] = article['lastUpdated']
​    ​    ​    .replace('This page was last edited on', '')
        article['lastUpdated'] = article['lastUpdated'].strip()
        article['lastUpdated'] = datetime.strptime(
​    ​    ​    article['lastUpdated'], '%d %B %Y, at %H:%M.')
        article['text'] = [line for line in article['text']
​    ​    ​    if line not in whitespace]
        article['text'] = ''.join(article['text'])
        return article

The class WikispiderPipeline has a method process_item that takes in an Article object, parses the lastUpdated string into a Python datetime object, and cleans and joins the text into a single string from a list of strings.

process_item is a mandatory method for every pipeline class. Scrapy uses this method to asynchronously pass Items that are collected by the spider. The parsed Article object that is returned here will be logged or printed by Scrapy if, for example, you are outputting items to JSON or CSV as was done in the previous section.

You now have two choices when it comes to deciding where to do your data processing: the parse_items method in the spider, or the process_items method in the pipeline.

Multiple pipelines with different tasks can be declared in the settings.py file. However, Scrapy passes all items, regardless of item type, to each pipeline in order. Item-specific parsing may be better handled in the spider, before the data hits the pipeline. However, if this parsing takes a long time, you may want to consider moving it to the pipeline (where it can be processed asynchronously) and adding a check on the item type:

def process_item(self, item, spider):    
    if isinstance(item, Article):
        # Article-specific processing here

Which processing to do and where to do it is an important consideration when it comes to writing Scrapy projects, especially large ones.

Logging with Scrapy

The debug information generated by Scrapy can be useful, but, as you’ve likely noticed, it is often too verbose. You can easily adjust the level of logging by adding a line to the settings.py file in your Scrapy project:

LOG_LEVEL = 'ERROR'

Scrapy uses a standard hierarchy of logging levels, as follows:

  • CRITICAL

  • ERROR

  • WARNING

  • DEBUG

  • INFO

If logging is set to ERROR, only CRITICAL and ERROR logs will be displayed. If logging is set to INFO, all logs will be displayed, and so on.

In addition to controlling logging through the settings.py file, you can control where the logs go from the command line. To output logs to a separate logfile instead of the terminal, define a logfile when running from the command line:

$ scrapy crawl articles -s LOG_FILE=wiki.log

This creates a new logfile, if one does not exist, in your current directory and outputs all logs to it, leaving your terminal clear to display only the Python print statements you manually add.

More Resources

Scrapy is a powerful tool that handles many problems associated with crawling the web. It automatically gathers all URLs and compares them against predefined rules, makes sure all URLs are unique, normalizes relative URLs where needed, and recurses to go more deeply into pages.

Although this chapter hardly scratches the surface of Scrapy’s capabilities, I encourage you to check out the Scrapy documentation as well as Learning Scrapy, by Dimitrios Kouzis-Loukas (O’Reilly), which provides a comprehensive discourse on the framework.

Scrapy is an extremely large and sprawling library with many features. Its features work together seamlessly, but have many areas of overlap that allow users to easily develop their own particular style within it. If there’s something you’d like to do with Scrapy that has not been mentioned here, there is likely a way (or several) to do it!