We have seen the power of scikit-learn in this book, and this chapter will be no different. Let's import the CountVectorizer module to quickly count the occurrences of phrases in our text:
# The CountVectorizer is from sklearn's text feature extraction module
# the feature extraction module as a whole contains many tools built for extracting features from data.
# Earlier, we manually extracted data by applying functions such as num_caps, special_characters, and so on
# The CountVectorizer module specifically is built to quickly count occurrences of phrases within pieces of text
from sklearn.feature_extraction.text import CountVectorizer
We will start by simply creating an instance of CountVectorizer with two specific parameters. We will set the analyzer to char so that we count phrases of characters rather than words. ngram_range will be set to (1, 1) to grab only single-character occurrences:
one_cv = CountVectorizer(ngram_range=(1, 1), analyzer='char')
# The fit_transform method to learn the vocabulary and then
# transform our text series into a matrix which we will call one_char
# Previously we created a matrix of quantitative data by applying our own functions, now we are creating numerical matrices using sklearn
one_char = one_cv.fit_transform(text)
# Note it is a sparse matrix
# there are 70 unique chars (number of columns)
<1048485x70 sparse matrix of type '<type 'numpy.int64'>' with 6935190 stored elements in Compressed Sparse Row format>
Note the number of rows reflects the number of passwords we have been working with, and the 70 columns reflect the 70 different and unique characters found in the corpus:
# we can peak into the learned vocabulary of the CountVectorizer by calling the vocabulary_ attribute of the CV
# the keys are the learned phrases while the values represent a unique index used by the CV to keep track of the vocab
one_cv.vocabulary_
{u'\r': 0, u' ': 1, u'!': 2, u'"': 3, u'#': 4, u'$': 5, u'%': 6, u'&': 7, u"'": 8, u'(': 9, u')': 10, u'*': 11, u'+': 12, u',': 13, u'-': 14, u'.': 15, u'/': 16, u'0': 17, u'1': 18, u'2': 19, u'3': 20, u'4': 21, u'5': 22, u'6': 23, u'7': 24, u'8': 25, u'9': 26, u':': 27, u';': 28, u'<': 29, u'=': 30, ...
# Note that is auto lowercases!
We have all of these characters including letters, punctuation, and more. We should also note that there are no capital letters found anywhere in this vocabulary; this is due to the CountVectorizer auto-lowercase feature. Let's follow the same procedure, but this time, let's turn off the auto-lowercase feature that comes with CountVectorizer:
# now with lowercase=False, this way we will not force the lowercasing of characters
one_cv = CountVectorizer(ngram_range=(1, 1), analyzer='char', lowercase=False)
one_char = one_cv.fit_transform(text)
one_char
# there are now 96 unique chars (number of columns) ( 26 letters more :) )
<1048485x96 sparse matrix of type '<type 'numpy.int64'>' with 6955519 stored elements in Compressed Sparse Row format>
We get the following output:
one_cv.vocabulary_
{u'\r': 0, u' ': 1, u'!': 2, u'"': 3, u'#': 4, u'$': 5, u'%': 6, u'&': 7, u"'": 8, u'(': 9, u')': 10, u'*': 11, u'+': 12, u',': 13, u'-': 14, u'.': 15, u'/': 16, u'0': 17, u'1': 18, u'2': 19, u'3': 20,
.....
We have our capital letters now included in our attributes. This is evident when we count 26 more letters (70 to 96) in our vocabulary attribute. With our vectorizer, we can use it to transform new pieces of text, as shown:
# transforming a new password
pd.DataFrame(one_cv.transform(['qwerty123!!!']).toarray(), columns=one_cv.get_feature_names())
# cannot learn new vocab. If we introduce a new character, wouldn't matter
The following shows the output:
|
! |
" |
# |
$ |
% |
& |
' |
( |
... |
u |
v |
w |
x |
y |
z |
{ |
| |
} |
~ |
|||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
0 |
0 |
0 |
3 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
1 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
It is important to remember that once a vectorizer is fit, it cannot learn new vocabulary; for example:
print "~" in one_cv.vocabulary_
True
print "D" in one_cv.vocabulary_
True
print "\t" in one_cv.vocabulary_
False
# transforming a new password (adding \t [the tab character] into the mix)
pd.DataFrame(one_cv.transform(['qw\terty123!!!']).toarray(), columns=one_cv.get_feature_names())
We get the following output:
| ! | " | # |
$ |
% | ' | ( | ... | u | v | w | x | y | z | { | | | } | ~ | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
0 |
0 |
0 |
3 |
0 |
0 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
1 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
We end up with the same matrix even though the second password had a new character in it. Let's expand our universe by allowing for up to five-character phrases. This will count occurrences of unique one-, two-, three-, four-, and five-character phrases now. We should expect to see our vocabulary explode:
# now let's count all 1, 2, 3, 4, and 5 character phrases
five_cv = CountVectorizer(ngram_range=(1, 5), analyzer='char')
five_char = five_cv.fit_transform(text)
five_char
# there are 2,570,934 unique combo of up to 5-in-a-row-char phrases
<1048485x2570934 sparse matrix of type '<type 'numpy.int64'>' with 31053193 stored elements in Compressed Sparse Row format>
We went from 70 (we didn't turn off auto-lowercase) to 2,570,934 columns:
# much larger vocabulary!
five_cv.vocabulary_
{u'uer24': 2269299, u'uer23': 2269298, u'uer21': 2269297, u'uer20': 2269296, u'a4uz5': 640686, u'rotai': 2047903, u'hd20m': 1257873, u'i7n5': 1317982, u'fkhb8': 1146472, u'juy9f': 1460014, u'xodu': 2443742, u'xodt': 2443740,
We will turn off the lowercase to see how many unique phrases we can get:
# now let's count all 1, 2, 3, 4, and 5 character phrases
five_cv_lower = CountVectorizer(ngram_range=(1, 5), analyzer='char', lowercase=False)
five_char_lower = five_cv_lower.fit_transform(text)
five_char_lower
# there are 2,922,297 unique combo of up to 5-in-a-row-char phrases
<1048485x2922297 sparse matrix of type '<type 'numpy.int64'>' with 31080917 stored elements in Compressed Sparse Row format>
With lowercase off, our vocabulary grows to 2,922,297 items. We will use this data to extract the most common phrases in our corpus that are up to five characters. Note that this is different from our value_counts before. Previously, we were counting the most common whole passwords whereas, now, we are counting the most common phrases that occur within the passwords:
# let's grab the most common five char "phrases"
# we will accomplish this by using numpy to do some quick math
import numpy as np
# first we will sum across the rows of our data to get the total count of phrases
summed_features = np.sum(five_char, axis=0)
print summed_features.shape . # == (1, 2570934)
# we will then sort the summed_features variable and grab the 20 most common phrases' indices in the CV's vocabulary
top_20 = np.argsort(summed_features)[:,-20:]
top_20 # == (1, 2570934)
matrix([[1619465, 2166552, 1530799, 1981845, 2073035, 297134, 457130, 406411, 1792848, 352276, 1696853, 562360, 508193, 236639, 1308517, 994777, 36326, 171634, 629003, 100177]])
This gives us the indices (from 0 to 2570933) of the most-commonly occurring phrases that are up to five characters. To see the actual phrases, let's plug them into the get_feature_names method of our CountVectorizer, as shown:
# plug these into the features of the CV.
# sorting is done in ascending order so '1' is the most common phrase, followed by 'a'
np.array(five_cv.get_feature_names())[top_20]
array([[u'm', u't', u'l', u'r', u's', u'4', u'7', u'6', u'o', u'5', u'n', u'9', u'8', u'3', u'i', u'e', u'0', u'2', u'a', u'1']], dtype='<U5')
Unsurprisingly, the most common one- to five-character phrases are single characters (letters and numbers). Let's expand to see the most common 50 phrases:
# top 50 phrases
np.array(five_cv.get_feature_names())[np.argsort(summed_features)[:,-50:]]
array([[u'13', u'98', u'ng', u'21', u'01', u'er', u'in', u'20', u'10', u'x', u'11', u'v', u'23', u'00', u'19', u'z', u'an', u'j', u'w', u'f', u'12', u'p', u'y', u'b', u'k', u'g', u'h', u'c', u'd', u'u', u'm', u't', u'l', u'r', u's', u'4', u'7', u'6', u'o', u'5', u'n', u'9', u'8', u'3', u'i', u'e', u'0', u'2', u'a', u'1']], dtype='<U5')
Now we start to see two-character phrases. Let's expand even more to the top 100 phrases:
# top 100 phrases
np.array(five_cv.get_feature_names())[np.argsort(summed_features)[:,-100:]]
array([[u'61', u'33', u'50', u'07', u'18', u'41', u'198', u'09', u'el', u'80', u'lo', u'05', u're', u'ch', u'ia', u'03', u'90', u'89', u'91', u'08', u'32', u'56', u'81', u'16', u'25', u'la', u'le', u'51', u'as', u'34', u'al', u'45', u'ra', u'30', u'14', u'15', u'02', u'ha', u'99', u'52', u'li', u'88', u'31', u'22', u'on', u'123', u'ma', u'en', u'ar', u'q', u'13', u'98', u'ng', u'21', u'01', u'er', u'in', u'20', u'10', u'x', u'11', u'v', u'23', u'00', u'19', u'z', u'an', u'j', u'w', u'f', u'12', u'p', u'y', u'b', u'k', u'g', u'h', u'c', u'd', u'u', u'm', u't', u'l', u'r', u's', u'4', u'7', u'6', u'o', u'5', u'n', u'9', u'8', u'3', u'i', u'e', u'0', u'2', u'a', u'1']], dtype='<U5')
To get a more sensical phrases used in passwords, let's make a new vectorizer with lowercase set to False, and ngram_range set to (4, 7). This is done to avoid single-character phrases and we will try to get more context into what kinds of themes occur in the most common passwords:
seven_cv = CountVectorizer(ngram_range=(4, 7), analyzer='char', lowercase=False)
seven_char = seven_cv.fit_transform(text)
seven_char
<1048485x7309977 sparse matrix of type '<type 'numpy.int64'>' with 16293052 stored elements in Compressed Sparse Row format>
With our vectorizer built and fit, let's use it to grab the 100 most common four- to seven-character phrases:
summed_features = np.sum(seven_char, axis=0)
# top 100 tokens of length 4-7
np.array(seven_cv.get_feature_names())[np.argsort(summed_features)[:,-100:]]
array([[u'1011', u'star', u'56789', u'g123', u'ming', u'long', u'ang1', u'2002', u'3123', u'ing1', u'201314', u'2003', u'1992', u'2004', u'1122', u'ling', u'2001', u'20131', u'woai', u'lian', u'feng', u'2345678', u'1212', u'1101', u'01314', u'o123', u'345678', u'ever', u's123', u'uang', u'1010', u'1980', u'huan', u'i123', u'king', u'mari', u'2005', u'hong', u'6789', u'1981', u'00000', u'45678', u'2013', u'11111', u'1991', u'1231', u'ilove', u'admin', u'ilov', u'ange', u'2006', u'0131', u'admi', u'heng', u'1234567', u'5201', u'e123', u'234567', u'dmin', u'pass', u'8888', u'34567', u'zhang', u'jian', u'2007', u'5678', u'1982', u'2000', u'zhan', u'yang', u'n123', u'1983', u'4567', u'1984', u'1990', u'a123', u'2009', u'ster', u'1985', u'iang', u'2008', u'2010', u'xiao', u'chen', u'hang', u'wang', u'1986', u'1111', u'1989', u'0000', u'1988', u'1987', u'1314', u'love', u'123456', u'23456', u'3456', u'12345', u'2345', u'1234']], dtype='<U7')
Words and numbers stick out immediately, such as the following:
- pass, 1234, 56789 (easy phrases to remember)
- 1980, 1991, 1992, 2003, 2004, and so on (likely years of birth)
- ilove, love
- yang, zhan, hong (names)
To get an even better sense of interesting phrases, let's use the TF-IDF vectorizer in scikit-learn to isolate rare phrases that are interesting and, therefore, likely better to use in passwords:
# Term Frequency-Inverse Document Frequency (TF-IDF)
# What: Computes "relative frequency" of a word that appears in a document compared to its frequency across all documents
# Why: More useful than "term frequency" for identifying "important" words/phrases in each document (high frequency in that document, low frequency in other documents)
from sklearn.feature_extraction.text import TfidfVectorizer
We will begin by creating a vectorizer similar to the CountVectorizer we made earlier. ngram_range will be set to (1, 1) and the analyzer will be char:
one_tv = TfidfVectorizer(ngram_range=(1, 1), analyzer='char')
# once we instantiate the module, we will call upon the fit_transform method to learn the vocabulary and then
# transform our text series into a brand new matrix called one_char
# Previously we created a matrix of quantitative data by applying our own functions, now we are creating numerical
# matrices using sklearn
one_char_tf = one_tv.fit_transform(text)
# same shape as CountVectorizer
one_char_tf
<1048485x70 sparse matrix of type '<type 'numpy.float64'>' with 6935190 stored elements in Compressed Sparse Row format>
Let's use this new vectorizer to transform qwerty123:
# transforming a new password
pd.DataFrame(one_tv.transform(['qwerty123']).toarray(), columns=one_tv.get_feature_names())
We get the following output:
|
! |
" |
# |
$ |
% |
& |
' |
( |
... |
u |
v |
w |
x |
y |
z |
{ |
| |
} |
~ |
|||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
0.0 |
0.0 |
0.408704 |
0.0 |
0.369502 |
0.0 |
0.0 |
0.0 | 0.0 | 0.0 |
The values in the table are no longer counts anymore; they are calculations involving relative frequency. Higher values indicate that the phrase is either—or both—of the following:
- Used frequently in this password
- Used infrequently throughout the corpus of passwords
Let's build a more complex vectorizer with phrases learned up to five characters:
# make a five-char TfidfVectorizer
five_tv = TfidfVectorizer(ngram_range=(1, 5), analyzer='char')
five_char_tf = five_tv.fit_transform(text)
# same shape as CountVectorizer
five_char_tf
<1048485x2570934 sparse matrix of type '<type 'numpy.float64'>' with 31053193 stored elements in Compressed Sparse Row format>
Let's use this new vectorizer to transform the simple abc123 password:
# Let's see some tfidf values of passwords
# store the feature names as a numpy array
features = np.array(five_tv.get_feature_names())
# transform a very simple password
abc_transformed = five_tv.transform(['abc123'])
# grab the non zero features that is, the ngrams that actually exist
features[abc_transformed.nonzero()[1]]
array([u'c123', u'c12', u'c1', u'c', u'bc123', u'bc12', u'bc1', u'bc', u'b', u'abc12', u'abc1', u'abc', u'ab', u'a', u'3', u'23', u'2', u'123', u'12', u'1'], dtype='<U5')
We will look at the non-zero tfidf scores, as shown:
# grab the non zero tfidf scores of the features
abc_transformed[abc_transformed.nonzero()]
matrix([[0.28865293, 0.27817216, 0.23180301, 0.10303378, 0.33609531, 0.33285593, 0.31079987, 0.23023187, 0.11165455, 0.33695385, 0.31813905, 0.25043863, 0.18481603, 0.07089031, 0.08285116, 0.13324432, 0.07449711, 0.15211427, 0.12089443, 0.06747844]])
# put them together in a DataFrame
pd.DataFrame(abc_transformed[abc_transformed.nonzero()],
columns=features[abc_transformed.nonzero()[1]])
Running the preceding code yields the table where you will find that the phrase 1 has a TF-IDF score of 0.067478 while bc123 has a score of 0.336095, implying that bc123 is more interesting than 1, which makes sense:
# Let's repeat the process with a slightly better password
password_transformed = five_tv.transform(['sdf%ERF'])
# grab the non zero features
features[password_transformed.nonzero()[1]]
# grab the non zero tfidf scores of the features
password_transformed[password_transformed.nonzero()]
# put them together in a DataFrame
pd.DataFrame(password_transformed[password_transformed.nonzero()], columns=features[password_transformed.nonzero()[1]])
Running the preceding code yields a table in which the larger TF-IDF values is %er versus 123 that is, (0.453607 versus 0.152114). This implies that %er is more interesting and occurs less often across the entire corpus. Also note that the TF-IDF value of %er is larger than anything found in the abc123 password, implying that this phrase alone is more interesting than anything found in abc123.
Let's take all of this a step further and introduce a mathematical function called the cosine similarity to judge the strength of new passwords that haven't been seen before.