Index
B
- backoff, Unknown Words: Back-off and Smoothing-Unknown Words: Back-off and Smoothing
- backpropagation, Artificial Neural Networks
- bag-of-keyphrases, Predicting sentiment with a bag-of-keyphrases-Predicting sentiment with a bag-of-keyphrases
- bag-of-words (BOW), Contextual Features
- Baleen ingestion engine, The Baleen Ingestion Engine
- ball tree algorithm, Being Neighborly
- BaseEstimator interface (Scikit-Learn API), The BaseEstimator Interface
- betweenness centrality, Centrality-Centrality, Centrality, Glossary
- bias, defined, Glossary
- bias–variance trade-off, Cross-Validation
- bisecting k-means clustering, Text clustering with MLLib
- blocking
C
- canonicalization, Entity Resolution, Glossary
- centrality, Centrality-Centrality, Glossary
- chatbots, Language-Aware Data Products, Chatbots-Conclusion
- classification
- classification error, diagnosing, Diagnosing Classification Error-Confusion matrices
- classification heatmap, Classification report heatmaps-Classification report heatmaps, Glossary
- classification report, Model Evaluation, Classification report heatmaps-Classification report heatmaps, Glossary
- classifier models, Classifier Models
- closeness centrality, Centrality, Centrality, Glossary
- closure, Local fit, global evaluation
- cluster computing, with Spark, Cluster Computing with Spark-Local fit, global evaluation
- clustering
- agglomerative, Agglomerative clustering-Agglomerative clustering
- and model selection, Visualizing Clusters-Visualizing Clusters
- by document similarity, Clustering by Document Similarity-Agglomerative clustering
- defined, Glossary
- distance metrics, Distance Metrics-Distance Metrics
- efficient storage with JSON, Distributing the Corpus
- for text similarity, Clustering for Text Similarity-Agglomerative clustering
- hierarchical, Hierarchical Clustering-Agglomerative clustering
- partitive, Partitive Clustering-Handling uneven geometries
- text clustering with MLLib, Text clustering with MLLib-Text clustering with MLLib
- unsupervised learning on text, Unsupervised Learning on Text-Unsupervised Learning on Text
- visualizing, Visualizing Clusters-Visualizing Clusters
- clustering coefficient, Structural analysis
- co-occurrence plots, Co-occurrence plots-Co-occurrence plots
- collocation, Significant Collocations
- concurrency, parallelism vs., Python Multiprocessing
- conditional frequencies, Frequency and Conditional Frequency-Frequency and Conditional Frequency
- confidence score, Dialog: A Brief Exchange
- confusion matrix
- connectionist language model, Neural Language Models, Glossary
- constituency parsing, Constituency Parsing
- context-aware text analysis, Context-Aware Text Analysis-Conclusion
- context-free grammars, Context-Free Grammars
- contextual features of language, Contextual Features-Contextual Features
- continuous bag-of-words (CBOW), Distributed Representation, Glossary
- conversation
- convolutional neural networks (CNNs), Deep Learning Architectures
- corpus (corpora)
- corpus monitoring, Corpus monitoring
- corpus preprocessing and wrangling, Corpus Preprocessing and Wrangling-Conclusion
- breaking down documents, Breaking Down Documents
- deconstructing documents into paragraphs, What Is a Corpus?, Deconstructing Documents into Paragraphs-Deconstructing Documents into Paragraphs
- intermediate corpus analytics, Intermediate Corpus Analytics-Intermediate Corpus Analytics
- intermediate preprocessing and storage, Intermediate Preprocessing and Storage-Writing to pickle
- parallel preprocessing, Parallel Corpus Preprocessing-Parallel Corpus Preprocessing
- part-of-speech tagging, Part-of-Speech Tagging
- pickle method, Writing to pickle
- reading the processed corpus, Reading the Processed Corpus
- segmentation, Segmentation: Breaking Out Sentences
- tokenization, Tokenization: Identifying Individual Tokens
- transformation, Corpus Transformation-Reading the Processed Corpus
- corpus readers, Corpus Readers-Reading a Corpus from a Database
- corpus transformation, Corpus Transformation-Reading the Processed Corpus
- cosine distance, Distance Metrics
- cross-validation
- custom corpora, building, Building a Custom Corpus-Conclusion
D
- data management
- data parallelism, Scaling Text Analytics with Multiprocessing and Spark, Process Pools and Queues
- data products
- data science, The Data Science Paradigm-The Data Science Paradigm
- data, language as, Language as Data-Structural Features
- database, reading a corpus from, Reading a Corpus from a Database
- deduplication, Glossary
- deep learning
- deep structure analysis, Deep Structure Analysis-Predicting sentiment with a bag-of-keyphrases
- degree, Analyzing Graph Structure, Glossary
- degree centrality, Centrality-Centrality, Glossary
- dendrogram plot, Hierarchical Clustering
- dependency parsers, Dependency Parsing-Dependency Parsing
- dialog, Dialog: A Brief Exchange-Dialog: A Brief Exchange
- dialog system, Fundamentals of Conversation, Glossary
- diameter (graph), Analyzing Graph Structure, Glossary
- directed acyclic graphs (DAGs), Process Pools and Queues
- discourse, defined, Glossary
- disk structure
- dispersion plots, Text x-rays and dispersion plots-Text x-rays and dispersion plots
- distance metrics, Distance Metrics-Distance Metrics
- distributed computation, Cluster Computing with Spark
- distributed data storage, Cluster Computing with Spark
- distributed representation
- divisive clustering, Hierarchical Clustering, Glossary
- doc2vec algorithm, Distributed Representation, Glossary
- documents
- domain-specific corpora, Domain-Specific Corpora
- dropout layer, Predicting sentiment with a bag-of-keyphrases, Glossary
E
- edge, defined, Graph Computation and Analysis, Glossary
- edit distance, Distance Metrics
- eigenvector centrality, Centrality, Glossary
- elbow curves, Elbow curves, Glossary
- entities
- entity pairs, finding, Finding entity pairs
- entity resolution (ER), Entity Resolution-Fuzzy Blocking
- entropy, A Computational Model of Language, Glossary
- estimator, The BaseEstimator Interface, Glossary
- Euclidean distance, Distance Metrics
F
- F1 score, Model Evaluation-Model Evaluation, Glossary
- feature analysis
- feature extraction, Glossary
- feature space visualization, Visualizing Feature Space-Most informative features
- feature unions, Enriching Feature Extraction with Feature Unions-Enriching Feature Extraction with Feature Unions, Glossary
- features
- feedforward network, Artificial Neural Networks
- forking, Python Multiprocessing
- frequency distribution, Glossary
- frequency vectors, Frequency Vectors-The Gensim way
- frequency, in n-gram modeling, Frequency and Conditional Frequency-Frequency and Conditional Frequency
- fuzzy blocking, Fuzzy Blocking-Fuzzy Blocking
G
- generalizable model, Cross-Validation, Glossary
- Gensim
- GensimVectorizer transformer, Creating a custom Gensim vectorization transformer
- grammar, defined, Glossary
- grammar-based feature extraction, Grammar-Based Feature Extraction-Extracting Entities
- graph analysis of text, Graph Analysis of Text-Conclusion
- analyzing graph structure, Analyzing Graph Structure
- creating a graph-based thesaurus, Creating a Graph-Based Thesaurus
- creating a social graph, Creating a Social Graph-Implementing the graph extraction
- defined, Glossary
- entity resolution, Entity Resolution-Fuzzy Blocking
- extracting graphs from text, Extracting Graphs from Text-Structural analysis
- graph computation/analysis, Graph Computation and Analysis
- insights from social graph, Insights from the Social Graph-Structural analysis
- visual analysis of graphs, Visual Analysis of Graphs
- workflow, Extracting Graphs from Text
- graph, defined, Glossary
- Graph-tool, Graph Analysis of Text
- GraphExtractor class, Implementing the graph extraction
- GridSearch, Grid Search for Hyperparameter Optimization
- guided feature engineering, Guided Feature Engineering-Most informative features
K
- k splits, streaming access to, Streaming access to k splits
- k-fold cross-validation, Cross-Validation, Glossary
- k-means clustering, k-means clustering-Handling uneven geometries
- Keras API, Keras: An API for deep learning-Keras: An API for deep learning
- keyphrases, extracting, Extracting Keyphrases
- kitchen measurement conversion system, From Tablespoons to Grams-From Tablespoons to Grams
- Kneser–Ney smoothing, Unknown Words: Back-off and Smoothing-Unknown Words: Back-off and Smoothing
L
- language
- language model, defined, Glossary
- language-aware data products, Language-Aware Data Products-The model selection triple
- latent Dirichlet allocation (LDA), Latent Dirichlet Allocation-Visualizing topics
- latent semantic analysis (LSA), Latent Semantic Analysis-The Gensim way
- lemmatization, Creating a custom text normalization transformer
- lexical units, What Is a Corpus?
- lexicon, Glossary
- linguistic features, Language Features-Language Features
- link, Glossary
- logging, Running Tasks in Parallel
- long short-term memory (LSTM) networks, Deep Learning Architectures, Predicting sentiment with a bag-of-keyphrases
- long tail distribution
M
- machine learning
- Mahalanobis distance, Distance Metrics
- Manhattan distance, Distance Metrics
- MapReduce, Process Pools and Queues
- Minkowski distance, Distance Metrics
- MLLib
- model diagnostics
- model operationalization, Model Operationalization
- model selection triple workflow, The model selection triple, Glossary
- morphology, Structural Features, Glossary
- multilayer perceptron, Training a multilayer perceptron-Training a multilayer perceptron
- multiprocessing
N
- n gram, defined, Glossary
- n-gram analysis, Contextual Features
- n-gram feature extraction, n-Gram Feature Extraction-Significant Collocations
- n-gram language models, n-Gram Language Models-Language Generation
- n-gram viewer, n-gram viewer
- Naive Bayes, Classifier Models
- natural language
- natural language processing (NLP)
- defined, Glossary
- feature extraction for, Feature extraction-Feature extraction
- Spark MLLib and, From Scikit-Learn to MLLib-From Scikit-Learn to MLLib
- Spark operations, NLP with Spark-Local fit, global evaluation
- speeding up, Local fit, global evaluation-Local fit, global evaluation
- text classification with MLLib, Text classification with MLLib-Text classification with MLLib
- text clustering with MLLib, Text clustering with MLLib-Text clustering with MLLib
- natural language tool kit (NLTK)
- natural language understanding, Glossary
- neighborhood (graphs), Analyzing Graph Structure, Glossary
- network visualization, Network visualization-Network visualization
- network, defined, Glossary
- NetworkX, Tools for Text Analysis, Graph Analysis of Text
- neural language models, Neural Language Models-Keras: An API for deep learning
- neural networks, Deep Learning and Beyond-The Future Is (Almost) Here
- nodes, Graph Computation and Analysis, Glossary
- non-negative matrix factorization (NNMF), Non-Negative Matrix Factorization
P
- paragraph vector
- paragraphs, deconstructing documents into, What Is a Corpus?, Deconstructing Documents into Paragraphs-Deconstructing Documents into Paragraphs
- parallelism, Glossary
- parameters, defined, From Scikit-Learn to MLLib
- parsing, defined, Glossary
- part-of-speech tagging, Part-of-Speech Tagging, Part-of-speech tagging-Part-of-speech tagging, Glossary
- partitive clustering, Partitive Clustering-Handling uneven geometries
- perceptron, multilayer, Training a multilayer perceptron-Training a multilayer perceptron
- perplexity, A Computational Model of Language, Estimating Maximum Likelihood, Glossary
- pickle
- pipelines, Pipelines-Enriching Feature Extraction with Feature Unions
- precision, defined, Model Evaluation, Glossary
- principal component analysis (PCA), Creating a custom text normalization transformer, Glossary
- process pools, Process Pools and Queues-Process Pools and Queues
- property graph model, Property graphs, Glossary
S
- scale-free networks, Structural analysis
- scaling text analytics, Scaling Text Analytics with Multiprocessing and Spark-Conclusion
- Scikit-Learn
- Scikit-Learn API, The Scikit-Learn API-Creating a custom text normalization transformer, Extending TransformerMixin-Creating a custom text normalization transformer
- scraping, defined, Glossary
- segmentation, Segmentation: Breaking Out Sentences, Glossary
- semantic analysis, Structural Features
- semantics, Structural Features, Glossary
- semi-structured data, Language as Data
- sentence boundaries, defined, Glossary
- sentences, What Is a Corpus?, Segmentation: Breaking Out Sentences
- sentiment analysis, Contextual Features
- separability, Cross-Validation
- Shannon–Weaver model, Fundamentals of Conversation
- shortest path, defined, Glossary
- significant collocations, Significant Collocations
- silhouette coefficient, Silhouette scores, Glossary
- silhouette score, Glossary
- singular value decomposition (SVD)
- size (graphs), Analyzing Graph Structure, Glossary
- small world phenomenon, Structural analysis
- smoothing, Unknown Words: Back-off and Smoothing-Unknown Words: Back-off and Smoothing
- social graphs
- spaCy, Tools for Text Analysis
- Spark
- about, Anatomy of a Spark Job
- client mode vs. cluster mode, Anatomy of a Spark Job
- cluster computing with, Cluster Computing with Spark-Local fit, global evaluation
- distributing corpus, Distributing the Corpus-RDD Operations
- feature extraction for NLP, Feature extraction-Feature extraction
- MLLib, From Scikit-Learn to MLLib-From Scikit-Learn to MLLib
- NLP with, NLP with Spark-Local fit, global evaluation
- RDD operations, RDD Operations-RDD Operations
- speeding up NLP with, Local fit, global evaluation-Local fit, global evaluation
- text classification with MLLib, Text classification with MLLib-Text classification with MLLib
- text clustering with MLLib, Text clustering with MLLib-Text clustering with MLLib
- spawning, Python Multiprocessing
- speech data, Language-Aware Data Products
- Sqlite database, reading a corpus from, Reading a Corpus from a Database
- steering, Visual Steering-Elbow curves, Glossary
- (see also visual steering)
- stemming, Creating a custom text normalization transformer
- stopwords
- structural analysis, Structural analysis-Structural analysis
- structured data, Language as Data
- supervised learning, classification as, Text Classification
- support, in classification model evaluation, Model Evaluation
- symbolic language model, Glossary
- synsets, Creating a Graph-Based Thesaurus, Glossary
- syntactic analysis, Structural Features
- syntactic parsers, Syntactic Parsers
- syntax, Structural Features, Glossary
T
- t-distributed stochastic neighbor embedding (t-SNE)
- tagging, part-of-speech, Part-of-Speech Tagging, Part-of-speech tagging-Part-of-speech tagging
- task parallelism, Scaling Text Analytics with Multiprocessing and Spark
- TensorFlow, TensorFlow: A framework for deep learning
- term frequency-inverse document frequency (TF–IDF)
- text analysis
- text classification, Classification for Text Analysis-Conclusion
- about, Text Classification-Classifier Models
- building a text classification application, Building a Text Classification Application-Model Operationalization
- building an application for, Building a Text Classification Application-Model Operationalization
- classifier models, Classifier Models
- cross-validation, Cross-Validation-Streaming access to k splits
- identifying classification problems, Identifying Classification Problems-Identifying Classification Problems
- model construction, Model Construction-Model Construction
- model evaluation, Model Evaluation-Model Evaluation
- model operationalization, Model Operationalization
- visualizing classes, Visualizing Classes
- with MLLib, Text classification with MLLib-Text classification with MLLib
- text meaning representations (TMRs), Graph Analysis of Text
- text normalization transformer, Creating a custom text normalization transformer-Creating a custom text normalization transformer
- text vectorization, Text Vectorization and Transformation Pipelines-The Gensim way
- text visualization, Text Visualization-Conclusion
- TF–IDF distance, Distance Metrics
- thematic meaning representations (TMRs), Structural Features
- thesaurus, graph-based, Creating a Graph-Based Thesaurus
- thread, Python Multiprocessing
- tokenization, Tokenization: Identifying Individual Tokens, Glossary
- tokens
- topic modeling, Modeling Document Topics-In Scikit-Learn
- training and test splits, Glossary
- transformations
- transformer, defined, Extending TransformerMixin, Glossary
- transitivity, Structural analysis, Glossary
- traversal, defined, Glossary
- tweets, Corpus Disk Structure
Z
- Zipfian (long tail) distribution