The validation dataset is a part of the data that is kept aside and not used to train the model. This data is later used to tune hyperparameters and estimate model efficiency.
The validation dataset is not the same as the test dataset (other data that is also kept aside during the training phase). The difference between the test and validation datasets is that the test dataset will be used for model selection after it has been completely tuned.
However, there are cases where the validation dataset is not enough to tune the hyperparameters. In such cases, k-fold cross validation is performed on the model.
The input can be seen as follows:
validation_set.shape, training_set.shape
We get the following output:
Once we have sorted the data into different sets, we import the modules that will be used to perform the modeling of the data. We will be using sklearn to model this. If you do not have sklearn installed by now, use pip to install it:
from sklearn.feature_extraction.text import CountVectorizer
class sklearn.feature_extraction.text.CountVectorizer(input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)
CountVectorizor is a function that is widely used to convert a collection of text documents to vectors or matrices with respective token counts. CountVectorizor has the ability to transform uppercase to lowercase and can be used to get rid of punctuation marks. However, CountVectorizor cannot be used to stem strings. Stemming here refers to cutting either the beginning or the end of the word to account for prefixes and suffixes.
Basically, the idea of stemming is to remove derived words from the corresponding stem word.
Here is an example:
| Stemword | Before stemming |
| Hack | Hackes |
| Hack | Hacking |
| Cat | Catty |
| Cat | Catlike |
Similar to stimming, CountVectorizor can perform lemmatization of the source text as well. Lemmatization refers to the morphological analysis of the words. It thus removes all inflectional words and prints the root work, which is often said to be the lemma. Thus, in a way, lemmatization and stemming are closely related to each other:
| Rootword | Un-lemmatized word |
| good | better |
| good | best |
CountVectorizer can create features such as a bag of words with the n-gram range set to 1. Depending on the value we provide to the n-gram, we can generate bigrams, trigrams, and so on. The CountVectorizor has:
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline