Table of Contents for
Web Scraping with Python, 2nd Edition

Version ebook / Retour

Cover image for bash Cookbook, 2nd Edition Web Scraping with Python, 2nd Edition by Ryan Mitchell Published by O'Reilly Media, Inc., 2018
  1. nav
  2. Cover
  3. Web Scraping with Python
  4. Web Scraping with Python
  5. Preface
  6. I. Building Scrapers
  7. 1. Your First Web Scraper
  8. 2. Advanced HTML Parsing
  9. 3. Writing Web Crawlers
  10. 4. Web Crawling Models
  11. 5. Scrapy
  12. 6. Storing Data
  13. II. Advanced Scraping
  14. 7. Reading Documents
  15. 8. Cleaning Your Dirty Data
  16. 9. Reading and Writing Natural Languages
  17. 10. Crawling Through Forms and Logins
  18. 11. Scraping JavaScript
  19. 12. Crawling Through APIs
  20. 13. Image Processing and Text Recognition
  21. 14. Avoiding Scraping Traps
  22. 15. Testing Your Website with Scrapers
  23. 16. Web Crawling in Parallel
  24. 17. Scraping Remotely
  25. 18. The Legalities and Ethics of Web Scraping
  26. Index
  27. About the Author
  28. Colophon
  1. Preface
    1. What Is Web Scraping?
    2. Why Web Scraping?
    3. About This Book
    4. Conventions Used in This Book
    5. Using Code Examples
    6. O’Reilly Safari
    7. How to Contact Us
    8. Acknowledgments
  2. I. Building Scrapers
  3. 1. Your First Web Scraper
    1. Connecting
    2. An Introduction to BeautifulSoup
      1. Installing BeautifulSoup
      2. Running BeautifulSoup
      3. Connecting Reliably and Handling Exceptions
  4. 2. Advanced HTML Parsing
    1. You Don’t Always Need a Hammer
    2. Another Serving of BeautifulSoup
      1. find() and find_all() with BeautifulSoup
      2. Other BeautifulSoup Objects
      3. Navigating Trees
    3. Regular Expressions
    4. Regular Expressions and BeautifulSoup
    5. Accessing Attributes
    6. Lambda Expressions
  5. 3. Writing Web Crawlers
    1. Traversing a Single Domain
    2. Crawling an Entire Site
      1. Collecting Data Across an Entire Site
    3. Crawling Across the Internet
  6. 4. Web Crawling Models
    1. Planning and Defining Objects
    2. Dealing with Different Website Layouts
    3. Structuring Crawlers
      1. Crawling Sites Through Search
      2. Crawling Sites Through Links
      3. Crawling Multiple Page Types
    4. Thinking About Web Crawler Models
  7. 5. Scrapy
    1. Installing Scrapy
      1. Initializing a New Spider
    2. Writing a Simple Scraper
    3. Spidering with Rules
    4. Creating Items
    5. Outputting Items
    6. The Item Pipeline
    7. Logging with Scrapy
    8. More Resources
  8. 6. Storing Data
    1. Media Files
    2. Storing Data to CSV
    3. MySQL
      1. Installing MySQL
      2. Some Basic Commands
      3. Integrating with Python
      4. Database Techniques and Good Practice
      5. “Six Degrees” in MySQL
    4. Email
  9. II. Advanced Scraping
  10. 7. Reading Documents
    1. Document Encoding
    2. Text
      1. Text Encoding and the Global Internet
    3. CSV
      1. Reading CSV Files
    4. PDF
    5. Microsoft Word and .docx
  11. 8. Cleaning Your Dirty Data
    1. Cleaning in Code
      1. Data Normalization
    2. Cleaning After the Fact
      1. OpenRefine
  12. 9. Reading and Writing Natural Languages
    1. Summarizing Data
    2. Markov Models
      1. Six Degrees of Wikipedia: Conclusion
    3. Natural Language Toolkit
      1. Installation and Setup
      2. Statistical Analysis with NLTK
      3. Lexicographical Analysis with NLTK
    4. Additional Resources
  13. 10. Crawling Through Forms and Logins
    1. Python Requests Library
    2. Submitting a Basic Form
    3. Radio Buttons, Checkboxes, and Other Inputs
    4. Submitting Files and Images
    5. Handling Logins and Cookies
      1. HTTP Basic Access Authentication
    6. Other Form Problems
  14. 11. Scraping JavaScript
    1. A Brief Introduction to JavaScript
      1. Common JavaScript Libraries
    2. Ajax and Dynamic HTML
      1. Executing JavaScript in Python with Selenium
      2. Additional Selenium Webdrivers
    3. Handling Redirects
    4. A Final Note on JavaScript
  15. 12. Crawling Through APIs
    1. A Brief Introduction to APIs
      1. HTTP Methods and APIs
      2. More About API Responses
    2. Parsing JSON
    3. Undocumented APIs
      1. Finding Undocumented APIs
      2. Documenting Undocumented APIs
      3. Finding and Documenting APIs Automatically
    4. Combining APIs with Other Data Sources
    5. More About APIs
  16. 13. Image Processing and Text Recognition
    1. Overview of Libraries
      1. Pillow
      2. Tesseract
      3. NumPy
    2. Processing Well-Formatted Text
      1. Adjusting Images Automatically
      2. Scraping Text from Images on Websites
    3. Reading CAPTCHAs and Training Tesseract
      1. Training Tesseract
    4. Retrieving CAPTCHAs and Submitting Solutions
  17. 14. Avoiding Scraping Traps
    1. A Note on Ethics
    2. Looking Like a Human
      1. Adjust Your Headers
      2. Handling Cookies with JavaScript
      3. Timing Is Everything
    3. Common Form Security Features
      1. Hidden Input Field Values
      2. Avoiding Honeypots
    4. The Human Checklist
  18. 15. Testing Your Website with Scrapers
    1. An Introduction to Testing
      1. What Are Unit Tests?
    2. Python unittest
      1. Testing Wikipedia
    3. Testing with Selenium
      1. Interacting with the Site
    4. unittest or Selenium?
  19. 16. Web Crawling in Parallel
    1. Processes versus Threads
    2. Multithreaded Crawling
      1. Race Conditions and Queues
      2. The threading Module
    3. Multiprocess Crawling
      1. Multiprocess Crawling
      2. Communicating Between Processes
    4. Multiprocess Crawling—Another Approach
  20. 17. Scraping Remotely
    1. Why Use Remote Servers?
      1. Avoiding IP Address Blocking
      2. Portability and Extensibility
    2. Tor
      1. PySocks
    3. Remote Hosting
      1. Running from a Website-Hosting Account
      2. Running from the Cloud
    4. Additional Resources
  21. 18. The Legalities and Ethics of Web Scraping
    1. Trademarks, Copyrights, Patents, Oh My!
      1. Copyright Law
    2. Trespass to Chattels
    3. The Computer Fraud and Abuse Act
    4. robots.txt and Terms of Service
    5. Three Web Scrapers
      1. eBay versus Bidder’s Edge and Trespass to Chattels
      2. United States v. Auernheimer and The Computer Fraud and Abuse Act
      3. Field v. Google: Copyright and robots.txt
    6. Moving Forward
  22. Index
Back to top