Agglomerative clustering is a type of hierarchical clustering that produces clusters starting with single instances that are iteratively aggregated by similarity until all belong to a single group.
An application programming interface formally defines how software components communicate. A data API might provide users with a systematic way to read or fetch information from the internet. The Scikit-Learn API exposes generalized access to machine learning algorithms implemented via class inheritance.
Bag-of-words is a method of encoding text, such that every document from the corpus is transformed into a vector whose length is equal to the vocabulary of the corpus. The primary insight of a bag-of-words representation is that meaning and similarity are encoded in vocabulary.
Baleen is an open source automated ingestion service for blogs to construct a corpus for natural language processing research.
Given a node N in a graph G, the betweenness centrality indicates how connected G is as a result of N. Betweenness centrality is computed as the ratio of the shortest paths in G that include N to the total number of shortest paths in G.
Bias is one of two sources of error in supervised learning problems, computed as the difference between an estimator’s predicted value and the true value. High bias indicates that the estimator’s predictions deviate from the correct answers by a significant amount.
Canonicalization is one of three primary tasks involved in entity resolution, which entails converting data with more than one possible representation into a standard form.
In a network graph, centrality is a measure of the relative importance of a node. Important nodes are connected directly or indirectly to the most nodes and thus have higher centrality.
A chatbot is a program that participates in turn-taking conversations and whose aim is to interpret input text or speech and to output appropriate, useful responses.
Classification is a type of supervised machine learning that attempts to learn patterns between instances composed of independent variables and their relationship to a given categorical target variable. A classifier can be trained to minimize error between predicted and actual categories in the training data, and once fit, can be deployed to assign categorical labels to new instances based on the patterns detected during training.
The classification report shows a representation of the main classification metrics (precision, recall, and F1 score) on a per-class basis.
Closeness centrality computes the average path distance from a node N in a graph G to all other nodes, normalized by the size of the graph. Closeness centrality describes how fast information originating at N will spread throughout G.
Unsupervised learning or clustering is a way of discovering hidden structure in unlabeled data. Clustering algorithms aim to discover latent patterns in unlabeled data using features to organize instances into meaningfully dissimilar groups.
A confusion matrix is one method for evaluating the accuracy of a classifier. After the classifier has been fit, a confusion matrix is a report of how individual test values for each of the predicted classes compare to their actual classes.
A connectionist model of language argues that units of language interact with each other in meaningful ways that are not necessarily encoded by sequential context, but can be learned with a neural network approach.
A corpus is a collection of related documents or utterances that contain natural language.
A corpus reader is a programmatic interface to read, seek, stream, and filter documents, and furthermore to expose data wrangling techniques like encoding and preprocessing for code that requires access to data within a corpus.
Cross-validation, or k-fold cross-validation, is the process of independently fitting a supervised learning model on k slices (training and test splits) of a dataset, which allows us to compare models and estimate in advance which will be most performant with unseen data. Cross-validation helps to balance the bias/variance trade-off.
Data products are software applications that derive value from data and in turn generate new data.
Deduplication is one of three primary tasks involved in entity resolution that entails eliminating duplicate (exact or virtual) copies of repeated data.
Deep learning broadly describes the large family of neural network architectures that contain multiple, interacting hidden layers.
The degree of a node N of a graph G is the number of edges of G that touch N.
Degree centrality measures the neighborhood size (degree) of each node in a graph G and normalizes by the total number of nodes in G.
In the context of a chatbot, a dialog system is an internal component that interprets input, maintains internal state, and produces responses.
The diameter of a graph G is the number of nodes traversed in the shortest path between the two most distant nodes of G.
Discourse is written or formally spoken communication and is generally more structured than informal written or spoken communication.
A distributed representation is a method of encoding text along a continuous scale. This means that the resulting document vector is not a simple mapping from token position to token score, but instead a feature space embedded to represent word similarity.
Divisive clustering is a type of hierarchical clustering that produces clusters by gradually dividing data, beginning with a cluster containing all instances and finishing with clusters containing single instances.
Doc2vec (an extension of word2vec) is an unsupervised algorithm that learns fixed-length feature representations from variable length documents.
In the context of text analytics, a document is a single instance of discourse. Corpora are comprised of many documents.
In the context of neural network, a dropout layer is designed to help prevent overfitting by randomly setting a fraction of the input units to 0 during each training cycle.
An edge E between nodes N and V in a graph G represents a connection between N and V.
Eigenvector centrality measures the centrality of a node N in a graph G by the degree of the nodes to which N is connected. Even if N has a small number of neighbors, if those neighbors have a very high degree, N may outrank some of its neighbors in eigenvector centrality. Eigenvector centrality is the basis of several variants such as Katz centrality and PageRank.
The elbow method visualizes multiple k-means clustering models with different values for k. Model selection is based on whether or not there is an “elbow” in the curve. If the curve looks like an arm with a clear change in angle from one part of the curve to another, an inflection point is the optimal value for k.
An entity is a unique thing (e.g., person, organization, product) with a set of attributes that describe it (e.g., name, address, shape, title, price, etc.). An entity may have multiple references across data sources, such as a person with two different email addresses, a company with two different phone numbers, or a product listed on two different websites.
Entity resolution is the task of disambiguating records that correspond to real-world entities across and within datasets.
Entropy measures the uncertainty or surprisal of a language model’s probability distribution.
In the context of the Scikit-Learn API, an estimator is any object that can learn from data. For instance, an estimator can be a machine learning model form, a vectorizer, or a transformer.
The F1 score is a weighted harmonic mean of precision. Recall that the best score is 1.0 and the worst is 0.0. Generally speaking, F1 scores are lower than accuracy measures as they embed precision and recall into their computation. As a rule of thumb, the weighted average of F1 should be used to compare classifier models, not global accuracy.
In machine learning, data is represented as a numeric feature space, where each property of the vector representation is a feature.
In the context of text analytics pipelines, feature extraction is the process of transforming documents into vector representations such that machine learning methods can be applied.
A feature union allows multiple data transformations to be performed independently and then concatenated into a composite vector.
A frequency distribution displays the relative frequency of outcomes (e.g., tokens, keyphrases, entities) in a given sample.
A generalizable model balances bias and variance to make meaningful predictions on unseen data.
A grammar is a set of rules that specify the components of well-structured sentences in a language.
A network graph is a data structure made of nodes connected by edges and can be used to model complex relationships, including textual and intertextual relationships.
A graph analytics approach to text analysis leverages the structure of graphs and the computational measures of graph theory to understand relationships between entities or other textual elements.
A hapax is a term that only appears once in a corpus.
In a neural network, a hidden layer consists of neurons and synapses that connect the input layer to the output layer. Synapses transmit signals between neurons, whose activation functions buffer incoming signals, thereby training the model.
Hierarchical clustering is a type of unsupervised learning that produces clusters with a predetermined ordering in a tree-structure so that a variable number of clusters exist at each level. Hierarchical models can be either agglomerative (bottom up) or divisive (top down).
In machine learning, hyperparameters are the parameters that define how the model operates; they are not directly learned during fit but are defined on instantiation. Examples include the alpha (penalty) for regularization, the kernel function in a support vector machine, the number of leaves or depth of a decision tree, the number of neighbors used in a nearest neighbor classifier, and the number of clusters in a k-means clustering.
In the context of data science, ingestion is the process by which we collect and store data.
In machine learning, instances are the points on which algorithms operate. In the context of text analytics, an instance is an entire document or complete utterance.
A language model attempts to take as input an incomplete phrase and infer the following words that most likely complete the utterance.
Latent Dirichlet Allocation is a topic discovery technique, in which topics are represented as the probability that each of a given set of terms will occur. Documents can in turn be represented in terms of a mixture of these topics.
Latent Semantic Analysis is a vector-based approach that can be used as a topic modeling technique that finds groups of documents with the same words and produces a sparse term-document matrix.
In the context of text analysis, a lexicon is a set of all of the unique vocabulary words from a corpus. Lexical resources often include mappings from this set to other utilities such as word senses, synonym sets, or phonetic representations.
A long tail, or Zipfian distribution, displays a large number of occurrences far from the central part of the frequency distribution.
Machine learning describes a broad set of methods for extracting meaningful patterns from existing data and applying those patterns to make decisions or predictions on future data.
The model selection triple describes a general machine learning workflow that involves repeated iteration through feature engineering, model selection, and hyperparameter tuning to arrive at the most accurate, generalizable model.
Morphology is the form of things, such as individual words or tokens. Morphological analysis describes the process of understanding how words are constructed and how word forms influence their part-of-speech.
Multiprocessing refers to the use of more than one central processing unit (CPU) at a time, and to the ability of a system to support or allocate tasks between more than one processor at a time.
An n-gram is an ordered sequence of either characters or words of length N.
Natural language processing refers to a suite of computational techniques for mapping between formal and natural languages.
Natural language understanding is a subtopic within natural language processing that refers to the computational techniques used to approximate the interpretation of natural language.
In the context of a network graph G and a given node N, the neighborhood of N is the subgraph F of G that contains all of the nodes adjacent (i.e., connected via an edge) to N.
A network is a data structure made of nodes connected by edges and can be used to model complex relationships, including textual and intertextual relationships. See also “graph.”
Neural networks refer to a family of models that are defined by an input layer (a vectorized representation of input data), a hidden layer that consists of neurons and synapses, and an output layer with the predicted values. Within the hidden layer, synapses transmit signals between neurons, which rely on an activation function to buffer incoming signals. The synapses apply weights to incoming values, and the activation function determines if the weighted inputs are sufficiently high to activate the neuron and pass the values on to the next layer of the network.
In the context of a graph data structure, a node is the fundamental unit of data. Nodes are connected by edges to form networks.
One-hot encoding is a boolean vector encoding method that marks a particular vector index with a value of true if the token exists in the document and false if it does not.
An ontology is a data structure that encodes meaning by specifying the properties and relationships of concepts and categories in a particular domain of discourse.
In the context of a network graph G, the order of G is defined as the number of nodes in G.
In the context of supervised learning, overfitting a model means that the model has memorized the training data and is completely accurate on data it has seen before, but varies widely on unseen data.
A paragraph vector is an unsupervised algorithm that learns fixed-length feature representations from variable-length documents, which enables us to extend word2vec to document-length instances.
Parallelism refers to multiprocessing computation and includes task parallelism (where different, independent operations run simultaneously on the same data) and data parallelism (where the same operation is being applied to many different inputs simultaneously).
In the context of text analytics, parsing is the process of breaking utterances down into composite pieces (e.g., documents into paragraphs, paragraphs into sentences, sentences into tokens), then building them into syntactic or semantic structures that can be computed upon.
Parts-of-speech are the classes assigned to parsed text that indicate how tokens are functioning in the context of a sentence. Example parts-of-speech include nouns, verbs, adjectives, and adverbs.
In the context of text analytics, partitive clustering methods partition documents into groups that are represented by a central vector (the centroid) or described by a density of documents per cluster. Centroids represent an aggregated value (e.g., mean or median) of all member documents and are a convenient way to describe documents in that cluster.
Perplexity is a measure of how predictable the text is by evaluating the entropy (the level of uncertainty or surprisal) of the language model’s probability distribution.
In the context of text analytics, a model pipeline is a method for chaining together a series of transformers that combine (for instance) normalization, vectorization, and feature analysis into a single, well-defined mechanism.
Precision is the ability of a classifier not to label an instance positive that is actually negative. For each class, it is defined as the ratio of true positives to the sum of true and false positives. Said another way, “For all instances classified as positive, what percent was correct?”
Principal Component Analysis is a method for transforming features into a new coordinate system that captures as much of the variability in the data as possible. PCA is often used as a dimensionality reduction technique for dense data.
In the context of a network graph, a property graph embeds information into the graph by allowing for labels and weights to be stored as additional information on graph nodes and edges.
Recall is the ability of a classifier to find all positive instances. For each class, it is defined as the ratio of true positives to the sum of true positives and false negatives. Said another way, “For all instances that were actually positive, what percent was classified correctly?”
Record linkage is one of three primary tasks involved in entity resolution, which entails identifying records that reference the same entity across different sources.
Regression is a supervised learning technique that attempts to learn patterns between instances composed of independent variables and their relationship to a continuous target variable. A regressor can be trained to minimize error between predicted and actual values in the training data, and once fit, can be deployed to assign predicted target values to new instances based on the patterns detected during training.
RSS is a category of web-based feeds that publish updates to online content in a standardized, computer-readable format.
Scraping refers to the process (whether automated, semiautomated, or manual) of gathering and copying information from the web to a data store.
In the context of text analytics, segmentation refers to the process of breaking paragraphs down into sentences to arrive at more granular units of discourse.
Semantics refer to the meaning of language (e.g., the meaning of a document or sentence).
Sentence boundaries such as capitalized words and certain punctuation marks indicate the beginning and ending of sentences. Most automated parsing and part-of-speech tagging tools rely on the existence of sentence boundaries.
Sentiment analysis refers to the process of computationally identifying and categorizing emotional polarity expressed in an utterance—e.g., to determine the relative negativity or positivity of the writer or speaker’s feelings.
Given a network graph G that contains nodes N and V, the shortest path between N and V is the one that contains the fewest edges.
A silhouette score is a method for quantifying the density and separation of clusters produced by a centroidal clustering model. The score is calculated by averaging the silhouette coefficient (density) for each sample, computed as the difference between the average intracluster distance and the mean nearest-cluster distance for each sample, normalized by the maximum value.
Singular Value Decomposition is a matrix factorization technique that transforms an original feature space into three matrices, including a diagonal matrix of singular values that describe a subspace. Singular Value Decomposition is a popular dimensionality reduction technique for sparse data and is used in Latent Semantic Analysis (LSA).
In a graph G, the size of G is defined as the number of edges it contains.
Steering is the process of guiding the machine learning process—e.g., by visually evaluating a series of different classification report heat maps to determine which fitted model is most performant, or inspecting the trade-off between bias and variance along different values of a certain hyperparameter.
Stopwords are words that are manually excluded from a text model, often because they occur very frequently in all documents in a corpus.
Symbolic language models treat text as discrete sequences of tokens with probabilities of occurrence.
The synset for a word W is a collection of cognitive synonyms that express distinct concepts related to W.
Syntax describes the sentence formation rules defined by grammar.
T-distributed stochastic neighbor embedding is a nonlinear dimensionality reduction method. t-SNE can be used to cluster similar documents by decomposing high-dimensional document vectors into two dimensions using probability distributions from both the original dimensionality and the decomposed dimensionality.
Term frequency–inverse document frequency is an encoding method that normalizes the frequency of tokens in a document with respect to the rest of the corpus. TF–IDF measures the relevance of a token to a document by the scaled frequency of the appearance of the term in the document, normalized by the inverse of the scaled frequency of the term in the entire corpus.
Tokens are the atomic unit of data in text analysis. They are strings of encoded bytes that represent semantic information, but do not contain any other information (such as a word sense).
Tokenization is the process of breaking down sentences by isolating tokens.
Topic modeling is an unsupervised machine learning technique for abstracting topics from collections of documents. See also “clustering.”
In supervised machine learning, data is divided into training and test splits on which models can be fit independently in order to compare (cross-validate) models and estimate in advance which will be most performant with unseen data. Dividing data into train and test splits is generally used to ensure that the model does not become overfit and is generalizable with respect to data the model was not trained on.
A transformer is a special type of estimator that creates a new dataset from an old one based on rules that it has learned from the fitting process.
In a network graph, transitivity is a measure of the likelihood that two nodes with a common connection are neighbors.
In the context of a graph, traversal is the process of traveling between nodes along edges.
Underfitting a model generally describes the scenario where a fitted model makes the same predictions every time (i.e., has low variance), but deviates from the correct answer by a significant amount (i.e., has high bias). Underfitting is symptomatic of not having enough data points, or not training a complex enough model.
Unsupervised learning or clustering is a way of discovering hidden structures in unlabeled data. Clustering algorithms aim to discover latent patterns in unlabeled data using features to organize instances into meaningfully dissimilar groups.
Utterances are short, self-contained chains of spoken or written speech. In speech analysis, utterances are usually bound by clear pauses. In text analysis, utterances are typically bound by punctuation meant to convey pauses.
Variance is one of two sources of error in supervised learning problems, computed as the average of the squared distances from each point to the mean. Low variance is an indication of an underfit model, which generally makes the same predictions every time regardless of the features. High variance is an indication of overfit, when the estimator has memorized the training data and may generalize poorly on unseen data.
Vectorization is the process of transforming non-numeric data (e.g., text, images, etc.) into vector representations on which machine learning methods can be applied.
A visualizer is a visual diagnostic tool that extends estimators to allow human steering of the feature analysis, model selection, and hyperparameter tuning processes (i.e., the model selection triple).
Word sense refers to the intended meaning of a particular word, given a context and assuming that many words have multiple connotations, interpretations, and usages.
The word2vec algorithm implements a word embedding model that produces distributed representations of text, such that words are embedded in space along with similar words based on their context.
Write-once read-many (or WORM) storage refers to the practice of persisting a version of the original data that is not modified during the extraction, transformation, or modeling phases.