Table of Contents for
Applied Text Analysis with Python
Close
Version ebook
/
Retour
Applied Text Analysis with Python
by Tony Ojeda
Published by O'Reilly Media, Inc., 2018
Cover
nav
Applied Text Analysis with Python
Applied Text Analysis with Python
Preface
1. Language and Computation
2. Building a Custom Corpus
3. Corpus Preprocessing and Wrangling
4. Text Vectorization and Transformation Pipelines
5. Classification for Text Analysis
6. Clustering for Text Similarity
7. Context-Aware Text Analysis
8. Text Visualization
9. Graph Analysis of Text
10. Chatbots
11. Scaling Text Analytics with Multiprocessing and Spark
12. Deep Learning and Beyond
Glossary
Index
About the Authors
Colophon
Preface
Computational Challenges of Natural Language
Linguistic Data: Tokens and Words
Enter Machine Learning
Tools for Text Analysis
What to Expect from This Book
Who This Book Is For
Code Examples and GitHub Repository
Conventions Used in This Book
Using Code Examples
O’Reilly Safari
How to Contact Us
Acknowledgments
1. Language and Computation
The Data Science Paradigm
Language-Aware Data Products
The Data Product Pipeline
Language as Data
A Computational Model of Language
Language Features
Contextual Features
Structural Features
Conclusion
2. Building a Custom Corpus
What Is a Corpus?
Domain-Specific Corpora
The Baleen Ingestion Engine
Corpus Data Management
Corpus Disk Structure
Corpus Readers
Streaming Data Access with NLTK
Reading an HTML Corpus
Reading a Corpus from a Database
Conclusion
3. Corpus Preprocessing and Wrangling
Breaking Down Documents
Identifying and Extracting Core Content
Deconstructing Documents into Paragraphs
Segmentation: Breaking Out Sentences
Tokenization: Identifying Individual Tokens
Part-of-Speech Tagging
Intermediate Corpus Analytics
Corpus Transformation
Intermediate Preprocessing and Storage
Reading the Processed Corpus
Conclusion
4. Text Vectorization and Transformation Pipelines
Words in Space
Frequency Vectors
One-Hot Encoding
Term Frequency–Inverse Document Frequency
Distributed Representation
The Scikit-Learn API
The BaseEstimator Interface
Extending TransformerMixin
Pipelines
Pipeline Basics
Grid Search for Hyperparameter Optimization
Enriching Feature Extraction with Feature Unions
Conclusion
5. Classification for Text Analysis
Text Classification
Identifying Classification Problems
Classifier Models
Building a Text Classification Application
Cross-Validation
Model Construction
Model Evaluation
Model Operationalization
Conclusion
6. Clustering for Text Similarity
Unsupervised Learning on Text
Clustering by Document Similarity
Distance Metrics
Partitive Clustering
Hierarchical Clustering
Modeling Document Topics
Latent Dirichlet Allocation
Latent Semantic Analysis
Non-Negative Matrix Factorization
Conclusion
7. Context-Aware Text Analysis
Grammar-Based Feature Extraction
Context-Free Grammars
Syntactic Parsers
Extracting Keyphrases
Extracting Entities
n-Gram Feature Extraction
An n-Gram-Aware CorpusReader
Choosing the Right n-Gram Window
Significant Collocations
n-Gram Language Models
Frequency and Conditional Frequency
Estimating Maximum Likelihood
Unknown Words: Back-off and Smoothing
Language Generation
Conclusion
8. Text Visualization
Visualizing Feature Space
Visual Feature Analysis
Guided Feature Engineering
Model Diagnostics
Visualizing Clusters
Visualizing Classes
Diagnosing Classification Error
Visual Steering
Silhouette Scores and Elbow Curves
Conclusion
9. Graph Analysis of Text
Graph Computation and Analysis
Creating a Graph-Based Thesaurus
Analyzing Graph Structure
Visual Analysis of Graphs
Extracting Graphs from Text
Creating a Social Graph
Insights from the Social Graph
Entity Resolution
Entity Resolution on a Graph
Blocking with Structure
Fuzzy Blocking
Conclusion
10. Chatbots
Fundamentals of Conversation
Dialog: A Brief Exchange
Maintaining a Conversation
Rules for Polite Conversation
Greetings and Salutations
Handling Miscommunication
Entertaining Questions
Dependency Parsing
Constituency Parsing
Question Detection
From Tablespoons to Grams
Learning to Help
Being Neighborly
Offering Recommendations
Conclusion
11. Scaling Text Analytics with Multiprocessing and Spark
Python Multiprocessing
Running Tasks in Parallel
Process Pools and Queues
Parallel Corpus Preprocessing
Cluster Computing with Spark
Anatomy of a Spark Job
Distributing the Corpus
RDD Operations
NLP with Spark
Conclusion
12. Deep Learning and Beyond
Applied Neural Networks
Neural Language Models
Artificial Neural Networks
Deep Learning Architectures
Sentiment Analysis
Deep Structure Analysis
The Future Is (Almost) Here
Glossary
Index