Bayes’ theorem is the premier method for understanding the probability of some event, P(A | B), given some new information, P(B | A), and a prior belief in the probability of the event, P(A):
The Bayesian method’s popularity has skyrocketed in the last decade, more and more rivaling traditional frequentist applications in academia, government, and business. In machine learning, one application of Bayes’ theorem to classification comes in the form of the naive Bayes classifier. Naive Bayes classifiers combine a number of desirable qualities in practical machine learning into a single classifier. These include:
An intuitative approach
The ability to work with small data
Low computation costs for training and prediction
Often solid results in a variety of settings.
Specifically, a naive Bayes classifier is based on:
where:
P(y | x1, ⋯, xj) is called the posterior and is the probability that an observation is class y given the observation’s values for the j features, x1, ⋯, xj.
P(x1, ...xj | y) is called likelihood and is the likelihood of an observation’s values for features, x1, ..., xj, given their class, y.
P(y) is called the prior and is our belief for the probability of class y before looking at the data.
P(x1, ..., xj) is called the marginal probability.
In naive Bayes, we compare an observation’s posterior values for each possible class. Specifically, because the marginal probability is constant across these comparisons, we compare the numerators of the posterior for each class. For each observation, the class with the greatest posterior numerator becomes the predicted class, ŷ.
There are two important things to note about naive Bayes classifiers. First, for each feature in the data, we have to assume the statistical distribution of the likelihood, P(xj | y). The common distributions are the normal (Gaussian), multinomial, and Bernoulli distributions. The distribution chosen is often determined by the nature of features (continuous, binary, etc.). Second, naive Bayes gets its name because we assume that each feature, and its resulting likelihood, is independent. This “naive” assumption is frequently wrong, yet in practice does little to prevent building high-quality classifiers.
In this chapter we will cover using scikit-learn to train three types of naive Bayes classifiers using three different likelihood distributions.
Use a Gaussian naive Bayes classifier in scikit-learn:
# Load librariesfromsklearnimportdatasetsfromsklearn.naive_bayesimportGaussianNB# Load datairis=datasets.load_iris()features=iris.datatarget=iris.target# Create Gaussian Naive Bayes objectclassifer=GaussianNB()# Train modelmodel=classifer.fit(features,target)
The most common type of naive Bayes classifier is the Gaussian naive Bayes. In Gaussian naive Bayes, we assume that the likelihood of the feature values, x, given an observation is of class y, follows a normal distribution:
where σy2 and μy are the variance and mean values of feature xj for class y. Because of the assumption of the normal distribution, Gaussian naive Bayes is best used in cases when all our features are continuous.
In scikit-learn, we train a Gaussian naive Bayes like any other model
using fit, and in turn can then make predictions about the class of an
observation:
# Create new observationnew_observation=[[4,4,4,0.4]]# Predict classmodel.predict(new_observation)
array([1])
One of the interesting aspects of naive Bayes classifiers is that they
allow us to assign a prior belief over the respected target classes.
We can do this using GaussianNB’s priors parameter, which takes in
a list of the probabilities assigned to each class of the target vector:
# Create Gaussian Naive Bayes object with prior probabilities of each classclf=GaussianNB(priors=[0.25,0.25,0.5])# Train modelmodel=classifer.fit(features,target)
If we do not add any argument to the priors parameter, the prior is
adjusted based on the data.
Finally, note that the raw predicted probabilities from Gaussian naive
Bayes (outputted using predict_proba) are not calibrated. That is,
they should not be believed. If we want to create useful predicted
probabilities, we will need to calibrate them using an isotonic
regression or a related method.
Use a multinomial naive Bayes classifier:
# Load librariesimportnumpyasnpfromsklearn.naive_bayesimportMultinomialNBfromsklearn.feature_extraction.textimportCountVectorizer# Create texttext_data=np.array(['I love Brazil. Brazil!','Brazil is best','Germany beats both'])# Create bag of wordscount=CountVectorizer()bag_of_words=count.fit_transform(text_data)# Create feature matrixfeatures=bag_of_words.toarray()# Create target vectortarget=np.array([0,0,1])# Create multinomial naive Bayes object with prior probabilities of each classclassifer=MultinomialNB(class_prior=[0.25,0.5])# Train modelmodel=classifer.fit(features,target)
Multinomial naive Bayes works similarly to Gaussian naive Bayes, but the features are assumed to be multinomially distributed. In practice, this means that this classifier is commonly used when we have discrete data (e.g., movie ratings ranging from 1 to 5). One of the most common uses of multinomial naive Bayes is text classification using bags of words or tf-idf approaches (see Recipes 6.8 and 6.9).
In our solution, we created a toy text dataset of three observations,
and converted the text strings into a bag-of-words feature matrix and an
accompanying target vector. We then used MultinomialNB to train a
model while defining the prior probabilities for the two classes
(pro-brazil and pro-germany).
MultinomialNB works similarly to GaussianNB; models are trained using
fit, and observations can be predicted using predict:
# Create new observationnew_observation=[[0,0,0,1,0,1,0]]# Predict new observation's classmodel.predict(new_observation)
array([0])
If class_prior is not specified, prior probabilities are learned using
the data. However, if we want a uniform distribution to be used as
the prior, we can set fit_prior=False.
Finally, MultinomialNB contains an additive smoothing hyperparameter,
alpha, that should be tuned. The default value is 1.0, with 0.0
meaning no smoothing takes place.
Use a Bernoulli naive Bayes classifier:
# Load librariesimportnumpyasnpfromsklearn.naive_bayesimportBernoulliNB# Create three binary featuresfeatures=np.random.randint(2,size=(100,3))# Create a binary target vectortarget=np.random.randint(2,size=(100,1)).ravel()# Create Bernoulli Naive Bayes object with prior probabilities of each classclassifer=BernoulliNB(class_prior=[0.25,0.5])# Train modelmodel=classifer.fit(features,target)
The Bernoulli naive Bayes classifier assumes that all our features are
binary such that they take only two values (e.g., a nominal categorical
feature that has been one-hot encoded). Like its multinomial cousin,
Bernoulli naive Bayes is often used in text classification, when our
feature matrix is simply the presence or absence of a word in a
document. Furthermore, like MultinomialNB, BernoulliNB has an
additive smoothing hyperparameter, alpha, we will want to tune using
model selection techniques. Finally, if we want to use priors we can use
the class_prior parameter with a list containing the prior
probabilities for each class. If we want to specify a uniform prior, we
can set fit_prior=False:
model_uniform_prior=BernoulliNB(class_prior=None,fit_prior=True)
Use CalibratedClassifierCV:
# Load librariesfromsklearnimportdatasetsfromsklearn.naive_bayesimportGaussianNBfromsklearn.calibrationimportCalibratedClassifierCV# Load datairis=datasets.load_iris()features=iris.datatarget=iris.target# Create Gaussian Naive Bayes objectclassifer=GaussianNB()# Create calibrated cross-validation with sigmoid calibrationclassifer_sigmoid=CalibratedClassifierCV(classifer,cv=2,method='sigmoid')# Calibrate probabilitiesclassifer_sigmoid.fit(features,target)# Create new observationnew_observation=[[2.6,2.6,2.6,0.4]]# View calibrated probabilitiesclassifer_sigmoid.predict_proba(new_observation)
array([[ 0.31859969, 0.63663466, 0.04476565]])
Class probabilities are a common and useful part of machine learning
models. In scikit-learn, most learning algortihms allow us to see the
predicted probabilities of class membership using predict_proba. This
can be extremely useful if, for instance, we want to only predict a
certain class if the model predicts the probability that it is that
class is over 90%. However, some models, including naive Bayes
classifiers, output probabilities that are not based on the real world.
That is, predict_proba might predict an observation has a 0.70 chance
of being a certain class, when the reality is that it is 0.10 or 0.99.
Specifically in naive Bayes, while the ranking of predicted
probabilities for the different target classes is valid, the raw
predicted probabilities tend to take on extreme values close to 0 and 1.
To obtain meaningful predicted probabilities we need conduct what is
called calibration. In scikit-learn we can use the
CalibratedClassifierCV class to create well-calibrated predicted
probabilities using k-fold cross-validation. In CalibratedClassifierCV
the training sets are used to train the model and the test set is used
to calibrate the predicted probabilities. The returned predicted
probabilities are the average of the k-folds.
Using our solution we can see the difference between raw and well-calibrated predicted probabilities. In our solution, we created a Gaussian naive Bayes classifier. If we train that classifier and then predict the class probabilities for a new observation, we can see very extreme probability estimates:
# Train a Gaussian naive Bayes then predict class probabilitiesclassifer.fit(features,target).predict_proba(new_observation)
array([[ 2.58229098e-04, 9.99741447e-01, 3.23523643e-07]])
However, if after we calibrate the predicted probabilities (which we did in our solution), we get very different results:
# View calibrated probabilitiesclassifer_sigmoid.predict_proba(new_observation)
array([[ 0.31859969, 0.63663466, 0.04476565]])
CalibratedClassifierCV offers two calibration methods—Platt’s
sigmoid model and isotonic regression—defined by the method
paramenter. While we don’t have the space to go into the specifics,
because isotonic regression is nonparametric it tends to overfit when
sample sizes are very small (e.g., 100 observations). In our solution we
used the Iris dataset with 150 observations and therefore used the
Platt’s sigmoid model.