Table of Contents for
Data Analysis with Open Source Tools
Close
Version ebook
/
Retour
Data Analysis with Open Source Tools
by Philipp K. Janert
Published by O'Reilly Media, Inc., 2010
Cover
Data Analysis with Open Source Tools
O'Reilly Strata Conference
Data Analysis with Open Source Tools
Dedication
A Note Regarding Supplemental Files
Preface
1. Introduction
I. Graphics: Looking at Data
2. A Single Variable: Shape and Distribution
3. Two Variables: Establishing Relationships
4. Time As a Variable: Time-Series Analysis
5. More Than Two Variables: Graphical Multivariate Analysis
6. Intermezzo: A Data Analysis Session
II. Analytics: Modeling Data
7. Guesstimation and the Back of the Envelope
8. Models from Scaling Arguments
9. Arguments from Probability Models
10. What You Really Need to Know About Classical Statistics
11. Intermezzo: Mythbusting—Bigfoot, Least Squares, and All That
III. Computation: Mining Data
12. Simulations
13. Finding Clusters
14. Seeing the Forest for the Trees: Finding Important Attributes
15. Intermezzo: When More Is Different
IV. Applications: Using Data
16. Reporting, Business Intelligence, and Dashboards
17. Financial Calculations and Modeling
18. Predictive Analytics
19. Epilogue: Facts Are Not Reality
A. Programming Environments for Scientific Computation and Data Analysis
B. Results from Calculus
C. Working with Data
D. About the Author
Index
About the Author
Colophon
Copyright
Data Analysis with Open Source Tools
Table of Contents
Dedication
A Note Regarding Supplemental Files
Preface
Before We Begin
Conventions Used in This Book
Using Code Examples
Safari
®
Books Online
How to Contact Us
Acknowledgments
1. Introduction
Data Analysis
What’s in This Book
What’s with the Workshops?
What’s with the Math?
What You’ll Need
What’s Missing
I. Graphics: Looking at Data
2. A Single Variable: Shape and Distribution
Dot and Jitter Plots
Histograms and Kernel Density Estimates
Histograms
Kernel Density Estimates
Optional: Optimal Bandwidth Selection
The Cumulative Distribution Function
Optional: Comparing Distributions with Probability Plots and QQ Plots
Rank-Order Plots and Lift Charts
Only When Appropriate: Summary Statistics and Box Plots
Summary Statistics
Box-and-Whisker Plots
Workshop: NumPy
NumPy in Action
NumPy in Detail
Further Reading
3. Two Variables: Establishing Relationships
Scatter Plots
Conquering Noise: Smoothing
Splines
LOESS
Examples
Residuals
Additional Ideas and Warnings
Logarithmic Plots
Banking
Linear Regression and All That
Showing What’s Important
Graphical Analysis and Presentation Graphics
Workshop: matplotlib
Using matplotlib Interactively
Case Study: LOESS with matplotlib
Managing Properties
The matplotlib Object Model and Architecture
Odds and Ends
Further Reading
4. Time As a Variable: Time-Series Analysis
Examples
The Task
Requirements and the Real World
Smoothing
Running Averages
Exponential Smoothing
Don’t Overlook the Obvious!
The Correlation Function
Examples
Implementation Issues
Optional: Filters and Convolutions
Workshop: scipy.signal
Further Reading
5. More Than Two Variables: Graphical Multivariate Analysis
False-Color Plots
A Lot at a Glance: Multiplots
The Scatter-Plot Matrix
The Co-Plot
Variations
Composition Problems
Changes in Composition
Multidimensional Composition: Tree and Mosaic Plots
Novel Plot Types
Glyphs
Parallel Coordinate Plots
Interactive Explorations
Querying and Zooming
Linking and Brushing
Grand Tours and Projection Pursuits
Tools
Workshop: Tools for Multivariate Graphics
R
Experimental Tools
Python Chaco Library
Further Reading
6. Intermezzo: A Data Analysis Session
A Data Analysis Session
Workshop: gnuplot
Further Reading
II. Analytics: Modeling Data
7. Guesstimation and the Back of the Envelope
Principles of Guesstimation
Estimating Sizes
Establishing Relationships
Working with Numbers
Powers of ten
Small perturbations
Logarithms
More Examples
Things I Know
How Good Are Those Numbers?
Before You Get Started: Feasibility and Cost
After You Finish: Quoting and Displaying Numbers
Optional: A Closer Look at Perturbation Theory and Error Propagation
Error Propagation
Workshop: The Gnu Scientific Library (GSL)
Further Reading
8. Models from Scaling Arguments
Models
Modeling
Using and Misusing Models
Arguments from Scale
Scaling Arguments
Example: A Dimensional Argument
Example: An Optimization Problem
Example: A Cost Model
Optional: Scaling Arguments Versus Dimensional Analysis
Other Arguments
Mean-Field Approximations
Background and Further Examples
Common Time-Evolution Scenarios
Unconstrained Growth and Decay Phenomena
Constrained Growth: The Logistic Equation
Oscillations
Case Study: How Many Servers Are Best?
Why Modeling?
Workshop: Sage
Further Reading
9. Arguments from Probability Models
The Binomial Distribution and Bernoulli Trials
Exact Results
Using Bernoulli Trials to Develop Mean-Field Models
The Gaussian Distribution and the Central Limit Theorem
The Central Limit Theorem
The Central Term and the Tails
Why Is the Gaussian so Useful?
Optional: Gaussian Integrals
Beware: The World Is Not Normal!
Power-Law Distributions and Non-Normal Statistics
Working with Power-Law Distributions
Optional: Distributions with Infinite Expectation Values
Where to Go from Here
Other Distributions
Geometric Distribution
Poisson Distribution
Log-Normal Distribution
Special-Purpose Distributions
Optional: Case Study—Unique Visitors over Time
Workshop: Power-Law Distributions
Further Reading
10. What You Really Need to Know About Classical Statistics
Genesis
Statistics Defined
Statistics Explained
Example: Formal Tests Versus Graphical Methods
Controlled Experiments Versus Observational Studies
Design of Experiments
Perspective
Optional: Bayesian Statistics—The Other Point of View
The Frequentist Interpretation of Probability
The Bayesian Interpretation of Probability
Bayesian Data Analysis: A Worked Example
Bayesian Inference: Summary and Discussion
Workshop: R
Further Reading
11. Intermezzo: Mythbusting—Bigfoot, Least Squares, and All That
How to Average Averages
Simpson’s Paradox
The Standard Deviation
How to Calculate
Optional: One over What?
Optional: The Standard Error
Least Squares
Statistical Parameter Estimation
Function Approximation
Further Reading
III. Computation: Mining Data
12. Simulations
A Warm-Up Question
Monte Carlo Simulations
Combinatorial Problems
Obtaining Outcome Distributions
Pro and Con
Resampling Methods
The Bootstrap
When Does Bootstrapping Work?
Bootstrap Variants
Workshop: Discrete Event Simulations with SimPy
Introducing SimPy
The Simplest Queueing Process
Optional: Queueing Theory
Running SimPy Simulations
Summary
Further Reading
13. Finding Clusters
What Constitutes a Cluster?
A Different Point of View
Distance and Similarity Measures
Common Distance and Similarity Measures
Numerical data
Categorical data
String data
Special-purpose metrics
Clustering Methods
Center Seekers
Tree Builders
Neighborhood Growers
Pre- and Postprocessing
Scale Normalization
Cluster Properties and Evaluation
Other Thoughts
A Special Case: Market Basket Analysis
A Word of Warning
Workshop: Pycluster and the C Clustering Library
Further Reading
14. Seeing the Forest for the Trees: Finding Important Attributes
Principal Component Analysis
Motivation
Optional: Theory
Interpretation
Computation
Practical Points
Biplots
Visual Techniques
Multidimensional Scaling
Network Graphs
Kohonen Maps
Workshop: PCA with R
Further Reading
Linear Algebra
15. Intermezzo: When More Is Different
A Horror Story
Some Suggestions
What About Map/Reduce?
Workshop: Generating Permutations
Further Reading
IV. Applications: Using Data
16. Reporting, Business Intelligence, and Dashboards
Business Intelligence
Reporting
Corporate Metrics and Dashboards
Recommendations for a Metrics Program
Data Quality Issues
Data Availability
Data Consistency
Workshop: Berkeley DB and SQLite
Berkeley DB
SQLite
Further Reading
17. Financial Calculations and Modeling
The Time Value of Money
A Single Payment: Future and Present Value
Multiple Payments: Compounding
Calculational Tricks with Compounding
The Whole Picture: Cash-Flow Analysis and Net Present Value
Uncertainty in Planning and Opportunity Costs
Using Expectation Values to Account for Uncertainty
Opportunity Costs
Cost Concepts and Depreciation
Direct and Indirect Costs
Fixed and Variable Costs
Capital Expenditure and Operating Cost
Should You Care?
Is This All That Matters?
Workshop: The Newsvendor Problem
Optional: Exact Solution
Further Reading
The Newsvendor Problem
18. Predictive Analytics
Topics in Predictive Analytics
Some Classification Terminology
Algorithms for Classification
Instance-Based Classifiers and Nearest-Neighbor Methods
Bayesian Classifiers
Regression
Support Vector Machines
Decision Trees and Rule-Based Classifiers
Other Classifiers
The Process
Ensemble Methods: Bagging and Boosting
Estimating Prediction Error
Class Imbalance Problems
The Secret Sauce
The Nature of Statistical Learning
Workshop: Two Do-It-Yourself Classifiers
Further Reading
19. Epilogue: Facts Are Not Reality
A. Programming Environments for Scientific Computation and Data Analysis
Software Tools
Scientific Software Is Different
A Catalog of Scientific Software
Matlab
R
Python
NumPy/SciPy
What About Java?
Other Players
Recommendations
Writing Your Own
Further Reading
Matlab
R
NumPy/SciPy
B. Results from Calculus
Common Functions
Powers
Polynomials and Rational Functions
Exponential Function and Logarithm
Trigonometric Functions
Gaussian Function and the Normal Distribution
Other Functions
The Inverse of a Function
Calculus
Derivatives
Finding Minima and Maxima
Integrals
Limits, Sequences, and Series
Power Series and Taylor Expansion
Useful Tricks
The Binomial Theorem
The Linear Transformation
Dividing by Zero
Notation and Basic Math
On Reading Formulas
Elementary Algebra
Working with Fractions
Sets, Sequences, and Series
Special Symbols
Binary relationships
Parentheses and other delimiters
Miscellaneous symbols
The Greek Alphabet
Where to Go from Here
On Math
Further Reading
Calculus
Linear Algebra
Complex Analysis
Mindbenders
C. Working with Data
Sources for Data
Cleaning and Conditioning
Sampling
Data File Formats
The Care and Feeding of Your Data Zoo
Skills
Terminology
Types of Data
The Data Type Depends on the Semantics
Types of Data Sets
Further Reading
Data Set Repositories
D. About the Author
Index
About the Author
Colophon
Copyright