The multinomial Naive Bayes classifier is suitable for classification with discrete features (for example, word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as TF-IDF may also work:
pipeline_parts = [ ('vectorizer', CountVectorizer()), ('classifier', MultinomialNB()) ] simple_pipeline = Pipeline(pipeline_parts)
A simple pipeline with Naive Bayes and the CountVectorizer is created as shown previously.
Import GridSearchCV as shown here:
from sklearn.model_selection import GridSearchCV
GridSearch performs an exhaustive search over specified parameter values for an estimator, and is thus helpful for hyperparameter tuning. GridSearch consists of the members fit and predict.
GridSearchCV implements a fit and a score method. It also implements predict, predict_proba, decision_function, transform, and inverse_transform if they are implemented in the estimator used.
The parameters of the estimator used to apply these methods are optimized by a cross-validated grid search over a parameter grid as follows:
simple_grid_search_params = { "vectorizer__ngram_range": [(1, 1), (1, 3), (1, 5)], "vectorizer__analyzer": ["word", "char", "char_wb"],}grid_search = GridSearchCV(simple_pipeline, simple_grid_search_params)grid_search.fit(X, y)
We set the grid search parameters and fit them. The output is shown as follows:
Out[97]: GridSearchCV(cv=None, error_score='raise',
estimator=Pipeline(memory=None,
steps=[('vectorizer', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), pre...one, vocabulary=None)), ('classifier', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
fit_params=None, iid=True, n_jobs=1,
param_grid={'vectorizer__ngram_range': [(1, 1), (1, 3), (1, 5)], 'vectorizer__analyzer': ['word', 'char', 'char_wb']},
pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
scoring=None, verbose=0)
We obtain the best cross validated accuracy as follows:
grid_search.best_score_ # best cross validated accuracymodel = grid_search.best_estimator_# % False, % True
model.predict_proba([my_tweets[0]])Finally, we test the accuracy of the model using the accuracy score sub-package that is available in the sklearn.metrics package:
from sklearn.metrics import accuracy_scoreaccuracy_score(model.predict(validation_set['text']), validation_set['is_mine']) # accuracy on validation set. Very good!
The model is able to give an accuracy of more than 90%:
0.9210526315789473
This model can now be used to monitor timelines to spot whether an author's style has changed or they are being hacked.