Chapter 18. Naive Bayes

18.0 Introduction

Bayes’ theorem is the premier method for understanding the probability of some event, P(A | B), given some new information, P(B | A), and a prior belief in the probability of the event, P(A):

P ( A B ) = P(BA)P(A) P(B)

The Bayesian method’s popularity has skyrocketed in the last decade, more and more rivaling traditional frequentist applications in academia, government, and business. In machine learning, one application of Bayes’ theorem to classification comes in the form of the naive Bayes classifier. Naive Bayes classifiers combine a number of desirable qualities in practical machine learning into a single classifier. These include:

  1. An intuitative approach

  2. The ability to work with small data

  3. Low computation costs for training and prediction

  4. Often solid results in a variety of settings.

Specifically, a naive Bayes classifier is based on:

P ( y x 1 , , x j ) = P(x 1 ,x j y)P(y) P(x 1 ,,x j )

where:

  • P(y | x1, ⋯, xj) is called the posterior and is the probability that an observation is class y given the observation’s values for the j features, x1, ⋯, xj.

  • P(x1, ...xj | y) is called likelihood and is the likelihood of an observation’s values for features, x1, ..., xj, given their class, y.

  • P(y) is called the prior and is our belief for the probability of class y before looking at the data.

  • P(x1, ..., xj) is called the marginal probability.

In naive Bayes, we compare an observation’s posterior values for each possible class. Specifically, because the marginal probability is constant across these comparisons, we compare the numerators of the posterior for each class. For each observation, the class with the greatest posterior numerator becomes the predicted class, ŷ.

There are two important things to note about naive Bayes classifiers. First, for each feature in the data, we have to assume the statistical distribution of the likelihood, P(xj | y). The common distributions are the normal (Gaussian), multinomial, and Bernoulli distributions. The distribution chosen is often determined by the nature of features (continuous, binary, etc.). Second, naive Bayes gets its name because we assume that each feature, and its resulting likelihood, is independent. This “naive” assumption is frequently wrong, yet in practice does little to prevent building high-quality classifiers.

In this chapter we will cover using scikit-learn to train three types of naive Bayes classifiers using three different likelihood distributions.

18.1 Training a Classifier for Continuous Features

Problem

You have only continuous features and you want to train a naive Bayes classifier.

Solution

Use a Gaussian naive Bayes classifier in scikit-learn:

# Load libraries
from sklearn import datasets
from sklearn.naive_bayes import GaussianNB

# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target

# Create Gaussian Naive Bayes object
classifer = GaussianNB()

# Train model
model = classifer.fit(features, target)

Discussion

The most common type of naive Bayes classifier is the Gaussian naive Bayes. In Gaussian naive Bayes, we assume that the likelihood of the feature values, x, given an observation is of class y, follows a normal distribution:

p ( x j y ) = 1 2πσ y 2 e -(x j -μ y ) 2 2σ y 2

where σy2 and μy are the variance and mean values of feature xj for class y. Because of the assumption of the normal distribution, Gaussian naive Bayes is best used in cases when all our features are continuous.

In scikit-learn, we train a Gaussian naive Bayes like any other model using fit, and in turn can then make predictions about the class of an observation:

# Create new observation
new_observation = [[ 4,  4,  4,  0.4]]

# Predict class
model.predict(new_observation)
array([1])

One of the interesting aspects of naive Bayes classifiers is that they allow us to assign a prior belief over the respected target classes. We can do this using GaussianNB’s priors parameter, which takes in a list of the probabilities assigned to each class of the target vector:

# Create Gaussian Naive Bayes object with prior probabilities of each class
clf = GaussianNB(priors=[0.25, 0.25, 0.5])

# Train model
model = classifer.fit(features, target)

If we do not add any argument to the priors parameter, the prior is adjusted based on the data.

Finally, note that the raw predicted probabilities from Gaussian naive Bayes (outputted using predict_proba) are not calibrated. That is, they should not be believed. If we want to create useful predicted probabilities, we will need to calibrate them using an isotonic regression or a related method.

18.2 Training a Classifier for Discrete and Count Features

Problem

Given discrete or count data, you need to train a naive Bayes classifier.

Solution

Use a multinomial naive Bayes classifier:

# Load libraries
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

# Create text
text_data = np.array(['I love Brazil. Brazil!',
                      'Brazil is best',
                      'Germany beats both'])

# Create bag of words
count = CountVectorizer()
bag_of_words = count.fit_transform(text_data)

# Create feature matrix
features = bag_of_words.toarray()

# Create target vector
target = np.array([0,0,1])

# Create multinomial naive Bayes object with prior probabilities of each class
classifer = MultinomialNB(class_prior=[0.25, 0.5])

# Train model
model = classifer.fit(features, target)

Discussion

Multinomial naive Bayes works similarly to Gaussian naive Bayes, but the features are assumed to be multinomially distributed. In practice, this means that this classifier is commonly used when we have discrete data (e.g., movie ratings ranging from 1 to 5). One of the most common uses of multinomial naive Bayes is text classification using bags of words or tf-idf approaches (see Recipes 6.8 and 6.9).

In our solution, we created a toy text dataset of three observations, and converted the text strings into a bag-of-words feature matrix and an accompanying target vector. We then used MultinomialNB to train a model while defining the prior probabilities for the two classes (pro-brazil and pro-germany).

MultinomialNB works similarly to GaussianNB; models are trained using fit, and observations can be predicted using predict:

# Create new observation
new_observation = [[0, 0, 0, 1, 0, 1, 0]]

# Predict new observation's class
model.predict(new_observation)
array([0])

If class_prior is not specified, prior probabilities are learned using the data. However, if we want a uniform distribution to be used as the prior, we can set fit_prior=False.

Finally, MultinomialNB contains an additive smoothing hyperparameter, alpha, that should be tuned. The default value is 1.0, with 0.0 meaning no smoothing takes place.

18.3 Training a Naive Bayes Classifier for Binary Features

Problem

You have binary feature data and need to train a naive Bayes classifier.

Solution

Use a Bernoulli naive Bayes classifier:

# Load libraries
import numpy as np
from sklearn.naive_bayes import BernoulliNB

# Create three binary features
features = np.random.randint(2, size=(100, 3))

# Create a binary target vector
target = np.random.randint(2, size=(100, 1)).ravel()

# Create Bernoulli Naive Bayes object with prior probabilities of each class
classifer = BernoulliNB(class_prior=[0.25, 0.5])

# Train model
model = classifer.fit(features, target)

Discussion

The Bernoulli naive Bayes classifier assumes that all our features are binary such that they take only two values (e.g., a nominal categorical feature that has been one-hot encoded). Like its multinomial cousin, Bernoulli naive Bayes is often used in text classification, when our feature matrix is simply the presence or absence of a word in a document. Furthermore, like MultinomialNB, BernoulliNB has an additive smoothing hyperparameter, alpha, we will want to tune using model selection techniques. Finally, if we want to use priors we can use the class_prior parameter with a list containing the prior probabilities for each class. If we want to specify a uniform prior, we can set fit_prior=False:

model_uniform_prior = BernoulliNB(class_prior=None, fit_prior=True)

18.4 Calibrating Predicted Probabilities

Problem

You want to calibrate the predicted probabilities from naive Bayes classifiers so they are interpretable.

Solution

Use CalibratedClassifierCV:

# Load libraries
from sklearn import datasets
from sklearn.naive_bayes import GaussianNB
from sklearn.calibration import CalibratedClassifierCV

# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target

# Create Gaussian Naive Bayes object
classifer = GaussianNB()

# Create calibrated cross-validation with sigmoid calibration
classifer_sigmoid = CalibratedClassifierCV(classifer, cv=2, method='sigmoid')

# Calibrate probabilities
classifer_sigmoid.fit(features, target)

# Create new observation
new_observation = [[ 2.6,  2.6,  2.6,  0.4]]

# View calibrated probabilities
classifer_sigmoid.predict_proba(new_observation)
array([[ 0.31859969,  0.63663466,  0.04476565]])

Discussion

Class probabilities are a common and useful part of machine learning models. In scikit-learn, most learning algortihms allow us to see the predicted probabilities of class membership using predict_proba. This can be extremely useful if, for instance, we want to only predict a certain class if the model predicts the probability that it is that class is over 90%. However, some models, including naive Bayes classifiers, output probabilities that are not based on the real world. That is, predict_proba might predict an observation has a 0.70 chance of being a certain class, when the reality is that it is 0.10 or 0.99. Specifically in naive Bayes, while the ranking of predicted probabilities for the different target classes is valid, the raw predicted probabilities tend to take on extreme values close to 0 and 1.

To obtain meaningful predicted probabilities we need conduct what is called calibration. In scikit-learn we can use the CalibratedClassifierCV class to create well-calibrated predicted probabilities using k-fold cross-validation. In CalibratedClassifierCV the training sets are used to train the model and the test set is used to calibrate the predicted probabilities. The returned predicted probabilities are the average of the k-folds.

Using our solution we can see the difference between raw and well-calibrated predicted probabilities. In our solution, we created a Gaussian naive Bayes classifier. If we train that classifier and then predict the class probabilities for a new observation, we can see very extreme probability estimates:

# Train a Gaussian naive Bayes then predict class probabilities
classifer.fit(features, target).predict_proba(new_observation)
array([[  2.58229098e-04,   9.99741447e-01,   3.23523643e-07]])

However, if after we calibrate the predicted probabilities (which we did in our solution), we get very different results:

# View calibrated probabilities
classifer_sigmoid.predict_proba(new_observation)
array([[ 0.31859969,  0.63663466,  0.04476565]])

CalibratedClassifierCV offers two calibration methods—Platt’s sigmoid model and isotonic regression—defined by the method paramenter. While we don’t have the space to go into the specifics, because isotonic regression is nonparametric it tends to overfit when sample sizes are very small (e.g., 100 observations). In our solution we used the Iris dataset with 150 observations and therefore used the Platt’s sigmoid model.