Hands-On Machine Learning for Cybersecurity

The validation dataset is a part of the data that is kept aside and not used to train the model. This data is later used to tune hyperparameters and estimate model efficiency.

The validation dataset is not the same as the test dataset (other data that is also kept aside during the training phase). The difference between the test and validation datasets is that the test dataset will be used for model selection after it has been completely tuned.

However, there are cases where the validation dataset is not enough to tune the hyperparameters. In such cases, k-fold cross validation is performed on the model.

The input can be seen as follows:

validation_set.shape, training_set.shape

We get the following output:

((38, 2), (348, 2))

X, y = training_set['text'], training_set['is_mine']

Once we have sorted the data into different sets, we import the modules that will be used to perform the modeling of the data. We will be using sklearn to model this. If you do not have sklearn installed by now, use pip to install it:

from sklearn.feature_extraction.text import CountVectorizer

class sklearn.feature_extraction.text.CountVectorizer(input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)

CountVectorizor is a function that is widely used to convert a collection of text documents to vectors or matrices with respective token counts. CountVectorizor has the ability to transform uppercase to lowercase and can be used to get rid of punctuation marks. However, CountVectorizor cannot be used to stem strings. Stemming here refers to cutting either the beginning or the end of the word to account for prefixes and suffixes.

Basically, the idea of stemming is to remove derived words from the corresponding stem word.

Here is an example:

Stemword	Before stemming
Hack	Hackes
Hack	Hacking
Cat	Catty
Cat	Catlike

Similar to stimming, CountVectorizor can perform lemmatization of the source text as well. Lemmatization refers to the morphological analysis of the words. It thus removes all inflectional words and prints the root work, which is often said to be the lemma. Thus, in a way, lemmatization and stemming are closely related to each other:

Rootword	Un-lemmatized word
good	better
good	best

CountVectorizer can create features such as a bag of words with the n-gram range set to 1. Depending on the value we provide to the n-gram, we can generate bigrams, trigrams, and so on. The CountVectorizor has:

from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline

Table of Contents for
Hands-On Machine Learning for Cybersecurity

Difference between test and validation datasets

Table of Contents for Hands-On Machine Learning for Cybersecurity

Table of Contents for
Hands-On Machine Learning for Cybersecurity