Table of Contents for
Data Wrangling with Python
Close
Version ebook
/
Retour
Data Wrangling with Python
by Katharine Jarmul
Published by O'Reilly Media, Inc., 2016
Cover
nav
Praise for Data Wrangling with Python
Data Wrangling with Python
Data Wrangling with Python
Preface
1. Introduction to Python
2. Python Basics
3. Data Meant to Be Read by Machines
4. Working with Excel Files
5. PDFs and Problem Solving in Python
6. Acquiring and Storing Data
7. Data Cleanup: Investigation, Matching, and Formatting
8. Data Cleanup: Standardizing and Scripting
9. Data Exploration and Analysis
10. Presenting Your Data
11. Web Scraping: Acquiring and Storing Data from the Web
12. Advanced Web Scraping: Screen Scrapers and Spiders
13. APIs
14. Automation and Scaling
15. Conclusion
A. Comparison of Languages Mentioned
B. Python Resources for Beginners
C. Learning the Command Line
D. Advanced Python Setup
E. Python Gotchas
F. IPython Hints
G. Using Amazon Web Services
Index
About the Authors
Colophon
Preface
Who Should Read This Book
Who Should Not Read This Book
How This Book Is Organized
What Is Data Wrangling?
What to Do If You Get Stuck
Conventions Used in This Book
Using Code Examples
O’Reilly Safari
How to Contact Us
Acknowledgments
1. Introduction to Python
Why Python
Getting Started with Python
Which Python Version
Setting Up Python on Your Machine
Test Driving Python
Install pip
Install a Code Editor
Optional: Install IPython
Summary
2. Python Basics
Basic Data Types
Strings
Integers and Floats
Data Containers
Variables
Lists
Dictionaries
What Can the Various Data Types Do?
String Methods: Things Strings Can Do
Numerical Methods: Things Numbers Can Do
List Methods: Things Lists Can Do
Dictionary Methods: Things Dictionaries Can Do
Helpful Tools: type, dir, and help
type
dir
help
Putting It All Together
What Does It All Mean?
Summary
3. Data Meant to Be Read by Machines
CSV Data
How to Import CSV Data
Saving the Code to a File; Running from Command Line
JSON Data
How to Import JSON Data
XML Data
How to Import XML Data
Summary
4. Working with Excel Files
Installing Python Packages
Parsing Excel Files
Getting Started with Parsing
Summary
5. PDFs and Problem Solving in Python
Avoid Using PDFs!
Programmatic Approaches to PDF Parsing
Opening and Reading Using slate
Converting PDF to Text
Parsing PDFs Using pdfminer
Learning How to Solve Problems
Exercise: Use Table Extraction, Try a Different Library
Exercise: Clean the Data Manually
Exercise: Try Another Tool
Uncommon File Types
Summary
6. Acquiring and Storing Data
Not All Data Is Created Equal
Fact Checking
Readability, Cleanliness, and Longevity
Where to Find Data
Using a Telephone
US Government Data
Government and Civic Open Data Worldwide
Organization and Non-Government Organization (NGO) Data
Education and University Data
Medical and Scientific Data
Crowdsourced Data and APIs
Case Studies: Example Data Investigation
Ebola Crisis
Train Safety
Football Salaries
Child Labor
Storing Your Data: When, Why, and How?
Databases: A Brief Introduction
Relational Databases: MySQL and PostgreSQL
Non-Relational Databases: NoSQL
Setting Up Your Local Database with Python
When to Use a Simple File
Cloud-Storage and Python
Local Storage and Python
Alternative Data Storage
Summary
7. Data Cleanup: Investigation, Matching, and Formatting
Why Clean Data?
Data Cleanup Basics
Identifying Values for Data Cleanup
Formatting Data
Finding Outliers and Bad Data
Finding Duplicates
Fuzzy Matching
RegEx Matching
What to Do with Duplicate Records
Summary
8. Data Cleanup: Standardizing and Scripting
Normalizing and Standardizing Your Data
Saving Your Data
Determining What Data Cleanup Is Right for Your Project
Scripting Your Cleanup
Testing with New Data
Summary
9. Data Exploration and Analysis
Exploring Your Data
Importing Data
Exploring Table Functions
Joining Numerous Datasets
Identifying Correlations
Identifying Outliers
Creating Groupings
Further Exploration
Analyzing Your Data
Separating and Focusing Your Data
What Is Your Data Saying?
Drawing Conclusions
Documenting Your Conclusions
Summary
10. Presenting Your Data
Avoiding Storytelling Pitfalls
How Will You Tell the Story?
Know Your Audience
Visualizing Your Data
Charts
Time-Related Data
Maps
Interactives
Words
Images, Video, and Illustrations
Presentation Tools
Publishing Your Data
Using Available Sites
Open Source Platforms: Starting a New Site
Jupyter (Formerly Known as IPython Notebooks)
Summary
11. Web Scraping: Acquiring and Storing Data from the Web
What to Scrape and How
Analyzing a Web Page
Inspection: Markup Structure
Network/Timeline: How the Page Loads
Console: Interacting with JavaScript
In-Depth Analysis of a Page
Getting Pages: How to Request on the Internet
Reading a Web Page with Beautiful Soup
Reading a Web Page with LXML
A Case for XPath
Summary
12. Advanced Web Scraping: Screen Scrapers and Spiders
Browser-Based Parsing
Screen Reading with Selenium
Screen Reading with Ghost.Py
Spidering the Web
Building a Spider with Scrapy
Crawling Whole Websites with Scrapy
Networks: How the Internet Works and Why It’s Breaking Your Script
The Changing Web (or Why Your Script Broke)
A (Few) Word(s) of Caution
Summary
13. APIs
API Features
REST Versus Streaming APIs
Rate Limits
Tiered Data Volumes
API Keys and Tokens
A Simple Data Pull from Twitter’s REST API
Advanced Data Collection from Twitter’s REST API
Advanced Data Collection from Twitter’s Streaming API
Summary
14. Automation and Scaling
Why Automate?
Steps to Automate
What Could Go Wrong?
Where to Automate
Special Tools for Automation
Using Local Files, argv, and Config Files
Using the Cloud for Data Processing
Using Parallel Processing
Using Distributed Processing
Simple Automation
CronJobs
Web Interfaces
Jupyter Notebooks
Large-Scale Automation
Celery: Queue-Based Automation
Ansible: Operations Automation
Monitoring Your Automation
Python Logging
Adding Automated Messaging
Uploading and Other Reporting
Logging and Monitoring as a Service
No System Is Foolproof
Summary
15. Conclusion
Duties of a Data Wrangler
Beyond Data Wrangling
Become a Better Data Analyst
Become a Better Developer
Become a Better Visual Storyteller
Become a Better Systems Architect
Where Do You Go from Here?
A. Comparison of Languages Mentioned
C, C++, and Java Versus Python
R or MATLAB Versus Python
HTML Versus Python
JavaScript Versus Python
Node.js Versus Python
Ruby and Ruby on Rails Versus Python
B. Python Resources for Beginners
Online Resources
In-Person Groups
C. Learning the Command Line
Bash
Navigation
Modifying Files
Executing Files
Searching with the Command Line
More Resources
Windows CMD/Power Shell
Navigation
Modifying Files
Executing Files
Searching with the Command Line
More Resources
D. Advanced Python Setup
Step 1: Install GCC
Step 2: (Mac Only) Install Homebrew
Step 3: (Mac Only) Tell Your System Where to Find Homebrew
Step 4: Install Python 2.7
Step 5: Install virtualenv (Windows, Mac, Linux)
Step 6: Set Up a New Directory
Step 7: Install virtualenvwrapper
Installing virtualenvwrapper (Mac and Linux)
Installing virtualenvwrapper-win (Windows)
Testing Your Virtual Environment (Windows, Mac, Linux)
Learning About Our New Environment (Windows, Mac, Linux)
Advanced Setup Review
E. Python Gotchas
Hail the Whitespace
The Dreaded GIL
= Versus == Versus is, and When to Just Copy
Default Function Arguments
Python Scope and Built-Ins: The Importance of Variable Names
Defining Objects Versus Modifying Objects
Changing Immutable Objects
Type Checking
Catching Multiple Exceptions
The Power of Debugging
F. IPython Hints
Why Use IPython?
Getting Started with IPython
Magic Functions
Final Thoughts: A Simpler Terminal
G. Using Amazon Web Services
Spinning Up an AWS Server
AWS Step 1: Choose an Amazon Machine Image (AMI)
AWS Step 2: Choose an Instance Type
AWS Step 7: Review Instance Launch
AWS Extra Question: Select an Existing Key Pair or Create a New One
Logging into an AWS Server
Get the Public DNS Name of the Instance
Prepare Your Private Key
Log into Your Server
Summary
Index