Unstructured text data, like the contents of a book or a tweet, is both one of the most interesting sources of features and one of the most complex to handle. In this chapter, we will cover strategies for transforming text into information-rich features. This is not to say that the recipes covered here are comprehensive. There exist entire academic disciplines focused on handling this and similar types of data, and the contents of all their techniques would fill a small library. Despite this, there are some commonly used techniques, and a knowledge of these will add valuable tools to our preprocessing toolbox.
Most basic text cleaning operations should only replace Python’s core
string operations, in particular strip, replace, and split:
# Create texttext_data=[" Interrobang. By Aishwarya Henriette ","Parking And Going. By Karl Gautier"," Today Is The night. By Jarek Prakash "]# Strip whitespacesstrip_whitespace=[string.strip()forstringintext_data]# Show textstrip_whitespace
['Interrobang. By Aishwarya Henriette', 'Parking And Going. By Karl Gautier', 'Today Is The night. By Jarek Prakash']
# Remove periodsremove_periods=[string.replace(".","")forstringinstrip_whitespace]# Show textremove_periods
['Interrobang By Aishwarya Henriette', 'Parking And Going By Karl Gautier', 'Today Is The night By Jarek Prakash']
We also create and apply a custom transformation function:
# Create functiondefcapitalizer(string:str)->str:returnstring.upper()# Apply function[capitalizer(string)forstringinremove_periods]
['INTERROBANG BY AISHWARYA HENRIETTE', 'PARKING AND GOING BY KARL GAUTIER', 'TODAY IS THE NIGHT BY JAREK PRAKASH']
Finally, we can use regular expressions to make powerful string operations:
# Import libraryimportre# Create functiondefreplace_letters_with_X(string:str)->str:returnre.sub(r"[a-zA-Z]","X",string)# Apply function[replace_letters_with_X(string)forstringinremove_periods]
['XXXXXXXXXXX XX XXXXXXXXX XXXXXXXXX', 'XXXXXXX XXX XXXXX XX XXXX XXXXXXX', 'XXXXX XX XXX XXXXX XX XXXXX XXXXXXX']
Most text data will need to be cleaned before we can use it to build
features. Most basic text cleaning can be completed using Python’s
standard string operations. In the real world we will most likely define
a custom cleaning function (e.g., capitalizer) combining some
cleaning tasks and apply that to the text data.
Use Beautiful Soup’s extensive set of options to parse and extract from HTML:
# Load libraryfrombs4importBeautifulSoup# Create some HTML codehtml="""<div class='full_name'><span style='font-weight:bold'>Masego</span> Azra</div>""""# Parse htmlsoup=BeautifulSoup(html,"lxml")# Find the div with the class "full_name", show textsoup.find("div",{"class":"full_name"}).text
'Masego Azra'
Despite the strange name, Beautiful Soup is a powerful Python library designed for scraping HTML. Typically Beautiful Soup is used scrape live websites, but we can just as easily use it to extract text data embedded in HTML. The full range of Beautiful Soup operations is beyond the scope of this book, but even the few methods used in our solution show how easily we can parse HTML code to extract the data we want.
Define a function that uses translate with a dictionary of punctuation
characters:
# Load librariesimportunicodedataimportsys# Create texttext_data=['Hi!!!! I. Love. This. Song....','10000% Agree!!!! #LoveIT','Right?!?!']# Create a dictionary of punctuation characterspunctuation=dict.fromkeys(iforiinrange(sys.maxunicode)ifunicodedata.category(chr(i)).startswith('P'))# For each string, remove any punctuation characters[string.translate(punctuation)forstringintext_data]
['Hi I Love This Song', '10000 Agree LoveIT', 'Right']
translate is a Python method popular due to its blazing speed. In our
solution, first we created a dictionary, punctuation, with all
punctuation characters according to Unicode as its keys and None as
its values. Next we translated all characters in the string that are in
punctuation into None, effectively removing them. There are more
readable ways to remove punctuation, but this somewhat hacky
solution has the advantage of being far faster than alternatives.
It is important to be conscious of the fact that punctuation contains information (e.g., “Right?” versus “Right!”). Removing punctuation is often a necessary evil to create features; however, if the punctuation is important we should make sure to take that into account.
Natural Language Toolkit for Python (NLTK) has a powerful set of text manipulation operations, including word tokenizing:
# Load libraryfromnltk.tokenizeimportword_tokenize# Create textstring="The science of today is the technology of tomorrow"# Tokenize wordsword_tokenize(string)
['The', 'science', 'of', 'today', 'is', 'the', 'technology', 'of', 'tomorrow']
We can also tokenize into sentences:
# Load libraryfromnltk.tokenizeimportsent_tokenize# Create textstring="The science of today is the technology of tomorrow. Tomorrow is today."# Tokenize sentencessent_tokenize(string)
['The science of today is the technology of tomorrow.', 'Tomorrow is today.']
Tokenization, especially word tokenization, is a common task after cleaning text data because it is the first step in the process of turning the text into data we will use to construct useful features.
Use NLTK’s stopwords:
# Load libraryfromnltk.corpusimportstopwords# You will have to download the set of stop words the first time# import nltk# nltk.download('stopwords')# Create word tokenstokenized_words=['i','am','going','to','go','to','the','store','and','park']# Load stop wordsstop_words=stopwords.words('english')# Remove stop words[wordforwordintokenized_wordsifwordnotinstop_words]
['going', 'go', 'store', 'park']
While “stop words” can refer to any set of words we want to remove before processing, frequently the term refers to extremely common words that themselves contain little information value. NLTK has a list of common stop words that we can use to find and remove stop words in our tokenized words:
# Show stop wordsstop_words[:5]
['i', 'me', 'my', 'myself', 'we']
Note that NLTK’s stopwords assumes the tokenized words are all
lowercased.
# Load libraryfromnltk.stem.porterimportPorterStemmer# Create word tokenstokenized_words=['i','am','humbled','by','this','traditional','meeting']# Create stemmerporter=PorterStemmer()# Apply stemmer[porter.stem(word)forwordintokenized_words]
['i', 'am', 'humbl', 'by', 'thi', 'tradit', 'meet']
Stemming reduces a word to its stem by identifying and removing affixes
(e.g., gerunds) while keeping the root meaning of the word. For example,
both “tradition” and “traditional” have “tradit” as their stem, indicating that while they are different words they represent the same
general concept. By stemming our text data, we transform it to something
less readable, but closer to its base meaning and thus more suitable for
comparison across observations. NLTK’s PorterStemmer implements the
widely used Porter stemming algorithm to remove or replace common
suffixes to produce the word stem.
Use NLTK’s pre-trained parts-of-speech tagger:
# Load librariesfromnltkimportpos_tagfromnltkimportword_tokenize# Create texttext_data="Chris loved outdoor running"# Use pre-trained part of speech taggertext_tagged=pos_tag(word_tokenize(text_data))# Show parts of speechtext_tagged
[('Chris', 'NNP'), ('loved', 'VBD'), ('outdoor', 'RP'), ('running', 'VBG')]
The output is a list of tuples with the word and the tag of the part of speech. NLTK uses the Penn Treebank parts for speech tags. Some examples of the Penn Treebank tags are:
| Tag | Part of speech |
|---|---|
| NNP | Proper noun, singular |
| NN | Noun, singular or mass |
| RB | Adverb |
| VBD | Verb, past tense |
| VBG | Verb, gerund or present participle |
| JJ | Adjective |
| PRP | Personal pronoun |
Once the text has been tagged, we can use the tags to find certain parts of speech. For example, here are all nouns:
# Filter words[wordforword,tagintext_taggediftagin['NN','NNS','NNP','NNPS']]
['Chris']
A more realistic situation would be that we have data where every
observation contains a tweet and we want to convert those sentences into
features for individual parts of speech (e.g., a feature with 1 if a proper noun is present, and 0 otherwise):
# Create texttweets=["I am eating a burrito for breakfast","Political science is an amazing field","San Francisco is an awesome city"]# Create listtagged_tweets=[]# Tag each word and each tweetfortweetintweets:tweet_tag=nltk.pos_tag(word_tokenize(tweet))tagged_tweets.append([tagforword,tagintweet_tag])# Use one-hot encoding to convert the tags into featuresone_hot_multi=MultiLabelBinarizer()one_hot_multi.fit_transform(tagged_tweets)
array([[1, 1, 0, 1, 0, 1, 1, 1, 0],
[1, 0, 1, 1, 0, 0, 0, 0, 1],
[1, 0, 1, 1, 1, 0, 0, 0, 1]])
Using classes_ we can see that each feature is a part-of-speech tag:
# Show feature namesone_hot_multi.classes_
array(['DT', 'IN', 'JJ', 'NN', 'NNP', 'PRP', 'VBG', 'VBP', 'VBZ'], dtype=object)
If our text is English and not on a specialized topic (e.g., medicine) the simplest solution is to use NLTK’s pre-trained parts-of-speech
tagger. However, if pos_tag is not very accurate, NLTK also gives us the ability to train our own tagger. The major downside of training a tagger is that we need a large corpus of text where the tag of each word is known. Constructing this tagged corpus is obviously labor intensive and is probably going to be a last resort.
All that said, if we had a tagged corpus and wanted to train a tagger,
the following is an example of how we could do it. The corpus we are using is the Brown Corpus, one of the most popular sources of tagged text. Here we use a backoff n-gram tagger, where n is the number of previous words we take into account when predicting a word’s part-of-speech tag. First we take into account the previous two words using TrigramTagger; if two words are not present, we “back off” and take into account the tag of the previous one word using BigramTagger, and finally if that fails we only look at the word itself using UnigramTagger. To examine the accuracy of our tagger, we split our text data into two parts, train our tagger on one part, and test how well it predicts the tags of the second part:
# Load libraryfromnltk.corpusimportbrownfromnltk.tagimportUnigramTaggerfromnltk.tagimportBigramTaggerfromnltk.tagimportTrigramTagger# Get some text from the Brown Corpus, broken into sentencessentences=brown.tagged_sents(categories='news')# Split into 4000 sentences for training and 623 for testingtrain=sentences[:4000]test=sentences[4000:]# Create backoff taggerunigram=UnigramTagger(train)bigram=BigramTagger(train,backoff=unigram)trigram=TrigramTagger(train,backoff=bigram)# Show accuracytrigram.evaluate(test)
0.8179229731754832
Use scikit-learn’s CountVectorizer:
# Load libraryimportnumpyasnpfromsklearn.feature_extraction.textimportCountVectorizer# Create texttext_data=np.array(['I love Brazil. Brazil!','Sweden is best','Germany beats both'])# Create the bag of words feature matrixcount=CountVectorizer()bag_of_words=count.fit_transform(text_data)# Show feature matrixbag_of_words
<3x8 sparse matrix of type '<class 'numpy.int64'>'
with 8 stored elements in Compressed Sparse Row format>
This output is a sparse array, which is often necessary when we have a
large amount of text. However, in our toy example we can use toarray
to view a matrix of word counts for each observation:
bag_of_words.toarray()
array([[0, 0, 0, 2, 0, 0, 1, 0],
[0, 1, 0, 0, 0, 1, 0, 1],
[1, 0, 1, 0, 1, 0, 0, 0]], dtype=int64)
We can use the vocabulary_ method to view the word associated with
each feature:
# Show feature namescount.get_feature_names()
['beats', 'best', 'both', 'brazil', 'germany', 'is', 'love', 'sweden']
This might be confusing, so for the sake of clarity here is what the feature matrix looks like with the words as column names (each row is one observation):
| beats | best | both | brazil | germany | is | love | sweden |
|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 2 | 0 | 0 | 1 | 0 |
| 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
| 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
One of the most common methods of transforming text into features is by
using a bag-of-words model. Bag-of-words models output a feature for
every unique word in text data, with each feature containing a count of
occurrences in observations. For example, in our solution the sentence
I love Brazil. Brazil! has a value of 2 in the “brazil” feature
because the word brazil appears two times.
The text data in our solution was purposely small. In the real world, a single observation of text data could be the contents of an entire book! Since our bag-of-words model creates a feature for every unique word in the data, the resulting matrix can contain thousands of features. This means that the size of the matrix can sometimes become very large in memory. However, luckily we can exploit a common characteristic of bag-of-words feature matrices to reduce the amount of data we need to store.
Most words likely do not occur in most observations, and therefore bag-of-words feature matrices will contain mostly 0s as values. We call these types of matrices “sparse.” Instead of storing all values of the matrix, we can only store nonzero values and then assume all other values are 0. This will save us memory when we have large feature matrices. One of the nice features of CountVectorizer is that the output is a sparse matrix by default.
CountVectorizer comes with a number of useful parameters to make
creating bag-of-words feature matrices easy. First, while by default
every feature is a word, that does not have to be the case. Instead we
can set every feature to be the combination of two words (called a
2-gram) or even three words (3-gram). ngram_range sets the minimum and
maximum size of our n-grams. For example, (2,3) will return all 2-grams and 3-grams. Second, we can easily remove low-information filler words using stop_words either with a built-in list or a custom list. Finally, we can restrict the words or phrases we want to consider to a certain list of words using vocabulary. For example, we could create a bag-of-words feature matrix for only occurrences of country names:
# Create feature matrix with argumentscount_2gram=CountVectorizer(ngram_range=(1,2),stop_words="english",vocabulary=['brazil'])bag=count_2gram.fit_transform(text_data)# View feature matrixbag.toarray()
array([[2],
[0],
[0]])
# View the 1-grams and 2-gramscount_2gram.vocabulary_
{'brazil': 0}
Compare the frequency of the word in a document (a tweet, movie
review, speech transcript, etc.) with the frequency of the word in all
other documents using term frequency-inverse document frequency
(tf-idf). scikit-learn makes this easy with TfidfVectorizer:
# Load librariesimportnumpyasnpfromsklearn.feature_extraction.textimportTfidfVectorizer# Create texttext_data=np.array(['I love Brazil. Brazil!','Sweden is best','Germany beats both'])# Create the tf-idf feature matrixtfidf=TfidfVectorizer()feature_matrix=tfidf.fit_transform(text_data)# Show tf-idf feature matrixfeature_matrix
<3x8 sparse matrix of type '<class 'numpy.float64'>'
with 8 stored elements in Compressed Sparse Row format>
Just as in Recipe 6.8, the output is a spare matrix. However, if we
want to view the output as a dense matrix, we can use .toarray:
# Show tf-idf feature matrix as dense matrixfeature_matrix.toarray()
array([[ 0. , 0. , 0. , 0.89442719, 0. ,
0. , 0.4472136 , 0. ],
[ 0. , 0.57735027, 0. , 0. , 0. ,
0.57735027, 0. , 0.57735027],
[ 0.57735027, 0. , 0.57735027, 0. , 0.57735027,
0. , 0. , 0. ]])
vocabulary_ shows us the word of each feature:
# Show feature namestfidf.vocabulary_
{'beats': 0,
'best': 1,
'both': 2,
'brazil': 3,
'germany': 4,
'is': 5,
'love': 6,
'sweden': 7}
The more a word appears in a document, the more likely it is important to that document. For example, if the word economy appears frequently, it is evidence that the document might be about economics. We call this term frequency (tf).
In contrast, if a word appears in many documents, it is likely less important to any individual document. For example, if every document in some text data contains the word after then it is probably an unimportant word. We call this document frequency (df).
By combining these two statistics, we can assign a score to every word representing how important that word is in a document. Specifically, we multiply tf to the inverse of document frequency (idf):
where t is a word and d is a document. There are a number of variations in how tf and idf are calculated. In scikit-learn, tf is simply the number of times a word appears in the document and idf is calculated as:
where nd is the number of documents and df(d,t) is term, t’s document frequency (i.e., number of documents where the term appears).
By default, scikit-learn then normalizes the tf-idf vectors using the Euclidean norm (L2 norm). The higher the resulting value, the more important the word is to a document.