Table of Contents for
Hands-On Machine Learning for Cybersecurity

Version ebook / Retour

Cover image for bash Cookbook, 2nd Edition Hands-On Machine Learning for Cybersecurity by Sinan Ozdemir Published by Packt Publishing, 2018
  1. Hands-on Machine Learning for Cybersecurity
  2. Title Page
  3. Copyright and Credits
  4. Hands-On Machine Learning for Cybersecurity
  5. About Packt
  6. Why subscribe?
  7. Packt.com
  8. Contributors
  9. About the authors
  10. About the reviewers
  11. Packt is searching for authors like you
  12. Table of Contents
  13. Preface
  14. Who this book is for
  15. What this book covers
  16. To get the most out of this book
  17. Download the example code files
  18. Download the color images
  19. Conventions used
  20. Get in touch
  21. Reviews
  22. Basics of Machine Learning in Cybersecurity
  23. What is machine learning?
  24. Problems that machine learning solves
  25. Why use machine learning in cybersecurity?
  26. Current cybersecurity solutions
  27. Data in machine learning
  28. Structured versus unstructured data
  29. Labelled versus unlabelled data
  30. Machine learning phases
  31. Inconsistencies in data
  32. Overfitting
  33. Underfitting
  34. Different types of machine learning algorithm
  35. Supervised learning algorithms
  36. Unsupervised learning algorithms 
  37. Reinforcement learning
  38. Another categorization of machine learning
  39. Classification problems
  40. Clustering problems
  41. Regression problems
  42. Dimensionality reduction problems
  43. Density estimation problems
  44. Deep learning
  45. Algorithms in machine learning
  46. Support vector machines
  47. Bayesian networks
  48. Decision trees
  49. Random forests
  50. Hierarchical algorithms
  51. Genetic algorithms
  52. Similarity algorithms
  53. ANNs
  54. The machine learning architecture
  55. Data ingestion
  56. Data store
  57. The model engine
  58. Data preparation 
  59. Feature generation
  60. Training
  61. Testing
  62. Performance tuning
  63. Mean squared error
  64. Mean absolute error
  65. Precision, recall, and accuracy
  66. How can model performance be improved?
  67. Fetching the data to improve performance
  68. Switching machine learning algorithms
  69. Ensemble learning to improve performance
  70. Hands-on machine learning
  71. Python for machine learning
  72. Comparing Python 2.x with 3.x 
  73. Python installation 
  74. Python interactive development environment
  75. Jupyter Notebook installation
  76. Python packages
  77. NumPy
  78. SciPy
  79. Scikit-learn 
  80. pandas
  81. Matplotlib
  82. Mongodb with Python
  83. Installing MongoDB
  84. PyMongo
  85. Setting up the development and testing environment
  86. Use case
  87. Data
  88. Code
  89. Summary
  90. Time Series Analysis and Ensemble Modeling
  91. What is a time series?
  92. Time series analysis
  93. Stationarity of a time series models
  94. Strictly stationary process
  95. Correlation in time series
  96. Autocorrelation
  97. Partial autocorrelation function
  98. Classes of time series models
  99. Stochastic time series model
  100. Artificial neural network time series model
  101.  Support vector time series models
  102. Time series components
  103. Systematic models
  104. Non-systematic models
  105. Time series decomposition
  106. Level 
  107. Trend 
  108. Seasonality 
  109. Noise 
  110. Use cases for time series
  111. Signal processing
  112. Stock market predictions
  113. Weather forecasting
  114. Reconnaissance detection
  115. Time series analysis in cybersecurity
  116. Time series trends and seasonal spikes
  117. Detecting distributed denial of series with time series
  118. Dealing with the time element in time series
  119. Tackling the use case
  120. Importing packages
  121. Importing data in pandas
  122. Data cleansing and transformation
  123. Feature computation
  124. Predicting DDoS attacks
  125. ARMA
  126. ARIMA
  127. ARFIMA
  128. Ensemble learning methods
  129. Types of ensembling
  130. Averaging
  131. Majority vote
  132. Weighted average
  133. Types of ensemble algorithm
  134. Bagging
  135. Boosting
  136. Stacking
  137. Bayesian parameter averaging
  138. Bayesian model combination
  139. Bucket of models
  140. Cybersecurity with ensemble techniques
  141. Voting ensemble method to detect cyber attacks
  142. Summary
  143. Segregating Legitimate and Lousy URLs
  144. Introduction to the types of abnormalities in URLs
  145. URL blacklisting
  146. Drive-by download URLs
  147. Command and control URLs
  148. Phishing URLs
  149. Using heuristics to detect malicious pages
  150. Data for the analysis
  151. Feature extraction
  152. Lexical features
  153. Web-content-based features
  154. Host-based features
  155. Site-popularity features
  156. Using machine learning to detect malicious URLs 
  157. Logistic regression to detect malicious URLs
  158. Dataset
  159. Model
  160. TF-IDF
  161. SVM to detect malicious URLs
  162. Multiclass classification for URL classification
  163. One-versus-rest
  164. Summary
  165. Knocking Down CAPTCHAs
  166. Characteristics of CAPTCHA
  167. Using artificial intelligence to crack CAPTCHA
  168. Types of CAPTCHA
  169. reCAPTCHA
  170. No CAPTCHA reCAPTCHA
  171. Breaking a CAPTCHA
  172. Solving CAPTCHAs with a neural network
  173. Dataset 
  174. Packages
  175. Theory of CNN
  176. Model
  177. Code
  178. Training the model
  179. Testing the model 
  180. Summary
  181. Using Data Science to Catch Email Fraud and Spam
  182. Email spoofing 
  183. Bogus offers
  184. Requests for help
  185. Types of spam emails
  186. Deceptive emails
  187. CEO fraud
  188. Pharming 
  189. Dropbox phishing
  190. Google Docs phishing
  191. Spam detection
  192. Types of mail servers 
  193. Data collection from mail servers
  194. Using the Naive Bayes theorem to detect spam
  195. Laplace smoothing
  196. Featurization techniques that convert text-based emails into numeric values
  197. Log-space
  198. TF-IDF
  199. N-grams
  200. Tokenization
  201. Logistic regression spam filters
  202. Logistic regression
  203. Dataset
  204. Python
  205. Results
  206. Summary
  207. Efficient Network Anomaly Detection Using k-means
  208. Stages of a network attack
  209. Phase 1 – Reconnaissance 
  210. Phase 2 – Initial compromise 
  211. Phase 3 – Command and control 
  212. Phase 4 – Lateral movement
  213. Phase 5 – Target attainment 
  214. Phase 6 – Ex-filtration, corruption, and disruption 
  215. Dealing with lateral movement in networks
  216. Using Windows event logs to detect network anomalies
  217. Logon/Logoff events 
  218. Account logon events
  219. Object access events
  220. Account management events
  221. Active directory events
  222. Ingesting active directory data
  223. Data parsing
  224. Modeling
  225. Detecting anomalies in a network with k-means
  226. Network intrusion data
  227. Coding the network intrusion attack
  228. Model evaluation 
  229. Sum of squared errors
  230. Choosing k for k-means
  231. Normalizing features
  232. Manual verification
  233. Summary
  234. Decision Tree and Context-Based Malicious Event Detection
  235. Adware
  236. Bots
  237. Bugs
  238. Ransomware
  239. Rootkit
  240. Spyware
  241. Trojan horses
  242. Viruses
  243. Worms
  244. Malicious data injection within databases
  245. Malicious injections in wireless sensors
  246. Use case
  247. The dataset
  248. Importing packages 
  249. Features of the data
  250. Model
  251. Decision tree 
  252. Types of decision trees
  253. Categorical variable decision tree
  254. Continuous variable decision tree
  255. Gini coeffiecient
  256. Random forest
  257. Anomaly detection
  258. Isolation forest
  259. Supervised and outlier detection with Knowledge Discovery Databases (KDD)
  260. Revisiting malicious URL detection with decision trees
  261. Summary
  262. Catching Impersonators and Hackers Red Handed
  263. Understanding impersonation
  264. Different types of impersonation fraud 
  265. Impersonators gathering information
  266. How an impersonation attack is constructed
  267. Using data science to detect domains that are impersonations
  268. Levenshtein distance
  269. Finding domain similarity between malicious URLs
  270. Authorship attribution
  271. AA detection for tweets
  272. Difference between test and validation datasets
  273. Sklearn pipeline
  274. Naive Bayes classifier for multinomial models
  275. Identifying impersonation as a means of intrusion detection 
  276. Summary
  277. Changing the Game with TensorFlow
  278. Introduction to TensorFlow
  279. Installation of TensorFlow
  280. TensorFlow for Windows users
  281. Hello world in TensorFlow
  282. Importing the MNIST dataset
  283. Computation graphs
  284. What is a computation graph?
  285. Tensor processing unit
  286. Using TensorFlow for intrusion detection
  287. Summary
  288. Financial Fraud and How Deep Learning Can Mitigate It
  289. Machine learning to detect financial fraud
  290. Imbalanced data
  291. Handling imbalanced datasets
  292. Random under-sampling
  293. Random oversampling
  294. Cluster-based oversampling
  295. Synthetic minority oversampling technique
  296. Modified synthetic minority oversampling technique
  297. Detecting credit card fraud
  298. Logistic regression
  299. Loading the dataset
  300. Approach
  301. Logistic regression classifier – under-sampled data
  302. Tuning hyperparameters 
  303. Detailed classification reports
  304. Predictions on test sets and plotting a confusion matrix
  305. Logistic regression classifier – skewed data
  306. Investigating precision-recall curve and area
  307. Deep learning time
  308. Adam gradient optimizer
  309. Summary
  310. Case Studies
  311. Introduction to our password dataset
  312. Text feature extraction
  313. Feature extraction with scikit-learn
  314. Using the cosine similarity to quantify bad passwords
  315. Putting it all together
  316. Summary
  317. Other Books You May Enjoy
  318. Leave a review - let other readers know what you think

Feature extraction with scikit-learn

We have seen the power of scikit-learn in this book, and this chapter will be no different. Let's import the CountVectorizer module to quickly count the occurrences of phrases in our text:

# The CountVectorizer is from sklearn's text feature extraction module
# the feature extraction module as a whole contains many tools built for extracting features from data.
# Earlier, we manually extracted data by applying functions such as num_caps, special_characters, and so on

# The CountVectorizer module specifically is built to quickly count occurrences of phrases within pieces of text
from sklearn.feature_extraction.text import CountVectorizer

We will start by simply creating an instance of CountVectorizer with two specific parameters. We will set the analyzer to char so that we count phrases of characters rather than words. ngram_range will be set to (1, 1) to grab only single-character occurrences:

one_cv = CountVectorizer(ngram_range=(1, 1), analyzer='char')

# The fit_transform method to learn the vocabulary and then
# transform our text series into a matrix which we will call one_char
# Previously we created a matrix of quantitative data by applying our own functions, now we are creating numerical matrices using sklearn


one_char = one_cv.fit_transform(text)
# Note it is a sparse matrix
# there are 70 unique chars (number of columns)
<1048485x70 sparse matrix of type '<type 'numpy.int64'>' with 6935190 stored elements in Compressed Sparse Row format>

Note the number of rows reflects the number of passwords we have been working with, and the 70 columns reflect the 70 different and unique characters found in the corpus:

# we can peak into the learned vocabulary of the CountVectorizer by calling the vocabulary_ attribute of the CV

# the keys are the learned phrases while the values represent a unique index used by the CV to keep track of the vocab
one_cv.vocabulary_

{u'\r': 0, u' ': 1, u'!': 2, u'"': 3, u'#': 4, u'$': 5, u'%': 6, u'&': 7, u"'": 8, u'(': 9, u')': 10, u'*': 11, u'+': 12, u',': 13, u'-': 14, u'.': 15, u'/': 16, u'0': 17, u'1': 18, u'2': 19, u'3': 20, u'4': 21, u'5': 22, u'6': 23, u'7': 24, u'8': 25, u'9': 26, u':': 27, u';': 28, u'<': 29, u'=': 30, ...
# Note that is auto lowercases!

We have all of these characters including letters, punctuation, and more. We should also note that there are no capital letters found anywhere in this vocabulary; this is due to the CountVectorizer auto-lowercase feature. Let's follow the same procedure, but this time, let's turn off the auto-lowercase feature that comes with CountVectorizer:

# now with lowercase=False, this way we will not force the lowercasing of characters
one_cv = CountVectorizer(ngram_range=(1, 1), analyzer='char', lowercase=False)


one_char = one_cv.fit_transform(text)

one_char

# there are now 96 unique chars (number of columns) ( 26 letters more :) )

<1048485x96 sparse matrix of type '<type 'numpy.int64'>' with 6955519 stored elements in Compressed Sparse Row format>

We get the following output:

one_cv.vocabulary_ 

{u'\r': 0, u' ': 1, u'!': 2, u'"': 3, u'#': 4, u'$': 5, u'%': 6, u'&': 7, u"'": 8, u'(': 9, u')': 10, u'*': 11, u'+': 12, u',': 13, u'-': 14, u'.': 15, u'/': 16, u'0': 17, u'1': 18, u'2': 19, u'3': 20,
.....

We have our capital letters now included in our attributes. This is evident when we count 26 more letters (70 to 96) in our vocabulary attribute. With our vectorizer, we can use it to transform new pieces of text, as shown:

# transforming a new password
pd.DataFrame(one_cv.transform(['qwerty123!!!']).toarray(), columns=one_cv.get_feature_names())

# cannot learn new vocab. If we introduce a new character, wouldn't matter

The following shows the output:

!

"

#

$

%

&

'

(

...

u

v

w

x

y

z

{

|

}

~

0

0

0

3

0

0

0

0

0

0

0

...

0

0

1

0

1

0

0

0

0

0

 

It is important to remember that once a vectorizer is fit, it cannot learn new vocabulary; for example:

print "~" in one_cv.vocabulary_
True

print "D" in one_cv.vocabulary_
True

print "\t" in one_cv.vocabulary_
False

# transforming a new password (adding \t [the tab character] into the mix)
pd.DataFrame(one_cv.transform(['qw\terty123!!!']).toarray(), columns=one_cv.get_feature_names())

We get the following output:

! " #

$

% ' ( ... u v w x y z { | } ~

0

0

0

3

0

0

0

0

0

0

...

0

0

1

0

1

0

0

0

0

0

0

 

We end up with the same matrix even though the second password had a new character in it. Let's expand our universe by allowing for up to five-character phrases. This will count occurrences of unique one-, two-, three-, four-, and five-character phrases now. We should expect to see our vocabulary explode:

# now let's count all 1, 2, 3, 4, and 5 character phrases
five_cv = CountVectorizer(ngram_range=(1, 5), analyzer='char')

five_char = five_cv.fit_transform(text)

five_char
# there are 2,570,934 unique combo of up to 5-in-a-row-char phrases

<1048485x2570934 sparse matrix of type '<type 'numpy.int64'>' with 31053193 stored elements in Compressed Sparse Row format>

We went from 70 (we didn't turn off auto-lowercase) to 2,570,934 columns:

# much larger vocabulary!

five_cv.vocabulary_

{u'uer24': 2269299, u'uer23': 2269298, u'uer21': 2269297, u'uer20': 2269296, u'a4uz5': 640686, u'rotai': 2047903, u'hd20m': 1257873, u'i7n5': 1317982, u'fkhb8': 1146472, u'juy9f': 1460014, u'xodu': 2443742, u'xodt': 2443740,

We will turn off the lowercase to see how many unique phrases we can get:

# now let's count all 1, 2, 3, 4, and 5 character phrases
five_cv_lower = CountVectorizer(ngram_range=(1, 5), analyzer='char', lowercase=False)

five_char_lower = five_cv_lower.fit_transform(text)

five_char_lower
# there are 2,922,297 unique combo of up to 5-in-a-row-char phrases

<1048485x2922297 sparse matrix of type '<type 'numpy.int64'>' with 31080917 stored elements in Compressed Sparse Row format>

With lowercase off, our vocabulary grows to 2,922,297 items. We will use this data to extract the most common phrases in our corpus that are up to five characters. Note that this is different from our value_counts before. Previously, we were counting the most common whole passwords whereas, now, we are counting the most common phrases that occur within the passwords:

# let's grab the most common five char "phrases"
# we will accomplish this by using numpy to do some quick math
import numpy as np

# first we will sum across the rows of our data to get the total count of phrases
summed_features = np.sum(five_char, axis=0)

print summed_features.shape . # == (1, 2570934)

# we will then sort the summed_features variable and grab the 20 most common phrases' indices in the CV's vocabulary
top_20 = np.argsort(summed_features)[:,-20:]

top_20 # == (1, 2570934)

matrix([[1619465, 2166552, 1530799, 1981845, 2073035, 297134, 457130, 406411, 1792848, 352276, 1696853, 562360, 508193, 236639, 1308517, 994777, 36326, 171634, 629003, 100177]])

This gives us the indices (from 0 to 2570933) of the most-commonly occurring phrases that are up to five characters. To see the actual phrases, let's plug them into the get_feature_names method of our CountVectorizer, as shown:

# plug these into the features of the CV.

# sorting is done in ascending order so '1' is the most common phrase, followed by 'a'
np.array(five_cv.get_feature_names())[top_20]


array([[u'm', u't', u'l', u'r', u's', u'4', u'7', u'6', u'o', u'5', u'n', u'9', u'8', u'3', u'i', u'e', u'0', u'2', u'a', u'1']], dtype='<U5')

Unsurprisingly, the most common one- to five-character phrases are single characters (letters and numbers). Let's expand to see the most common 50 phrases:

# top 50 phrases
np.array(five_cv.get_feature_names())[np.argsort(summed_features)[:,-50:]]

array([[u'13', u'98', u'ng', u'21', u'01', u'er', u'in', u'20', u'10', u'x', u'11', u'v', u'23', u'00', u'19', u'z', u'an', u'j', u'w', u'f', u'12', u'p', u'y', u'b', u'k', u'g', u'h', u'c', u'd', u'u', u'm', u't', u'l', u'r', u's', u'4', u'7', u'6', u'o', u'5', u'n', u'9', u'8', u'3', u'i', u'e', u'0', u'2', u'a', u'1']], dtype='<U5')

Now we start to see two-character phrases. Let's expand even more to the top 100 phrases:

# top 100 phrases
np.array(five_cv.get_feature_names())[np.argsort(summed_features)[:,-100:]]

array([[u'61', u'33', u'50', u'07', u'18', u'41', u'198', u'09', u'el', u'80', u'lo', u'05', u're', u'ch', u'ia', u'03', u'90', u'89', u'91', u'08', u'32', u'56', u'81', u'16', u'25', u'la', u'le', u'51', u'as', u'34', u'al', u'45', u'ra', u'30', u'14', u'15', u'02', u'ha', u'99', u'52', u'li', u'88', u'31', u'22', u'on', u'123', u'ma', u'en', u'ar', u'q', u'13', u'98', u'ng', u'21', u'01', u'er', u'in', u'20', u'10', u'x', u'11', u'v', u'23', u'00', u'19', u'z', u'an', u'j', u'w', u'f', u'12', u'p', u'y', u'b', u'k', u'g', u'h', u'c', u'd', u'u', u'm', u't', u'l', u'r', u's', u'4', u'7', u'6', u'o', u'5', u'n', u'9', u'8', u'3', u'i', u'e', u'0', u'2', u'a', u'1']], dtype='<U5')

To get a more sensical phrases used in passwords, let's make a new vectorizer with lowercase set to False, and ngram_range set to (4, 7). This is done to avoid single-character phrases and we will try to get more context into what kinds of themes occur in the most common passwords:

seven_cv = CountVectorizer(ngram_range=(4, 7), analyzer='char', lowercase=False)

seven_char = seven_cv.fit_transform(text)

seven_char

<1048485x7309977 sparse matrix of type '<type 'numpy.int64'>' with 16293052 stored elements in Compressed Sparse Row format>

With our vectorizer built and fit, let's use it to grab the 100 most common four- to seven-character phrases:

summed_features = np.sum(seven_char, axis=0)

# top 100 tokens of length 4-7
np.array(seven_cv.get_feature_names())[np.argsort(summed_features)[:,-100:]]


array([[u'1011', u'star', u'56789', u'g123', u'ming', u'long', u'ang1', u'2002', u'3123', u'ing1', u'201314', u'2003', u'1992', u'2004', u'1122', u'ling', u'2001', u'20131', u'woai', u'lian', u'feng', u'2345678', u'1212', u'1101', u'01314', u'o123', u'345678', u'ever', u's123', u'uang', u'1010', u'1980', u'huan', u'i123', u'king', u'mari', u'2005', u'hong', u'6789', u'1981', u'00000', u'45678', u'2013', u'11111', u'1991', u'1231', u'ilove', u'admin', u'ilov', u'ange', u'2006', u'0131', u'admi', u'heng', u'1234567', u'5201', u'e123', u'234567', u'dmin', u'pass', u'8888', u'34567', u'zhang', u'jian', u'2007', u'5678', u'1982', u'2000', u'zhan', u'yang', u'n123', u'1983', u'4567', u'1984', u'1990', u'a123', u'2009', u'ster', u'1985', u'iang', u'2008', u'2010', u'xiao', u'chen', u'hang', u'wang', u'1986', u'1111', u'1989', u'0000', u'1988', u'1987', u'1314', u'love', u'123456', u'23456', u'3456', u'12345', u'2345', u'1234']], dtype='<U7')

Words and numbers stick out immediately, such as the following:

  • pass, 1234, 56789 (easy phrases to remember)
  • 1980, 1991, 1992, 2003, 2004, and so on (likely years of birth)
  • ilove, love
  • yang, zhan, hong (names)

To get an even better sense of interesting phrases, let's use the TF-IDF vectorizer in scikit-learn to isolate rare phrases that are interesting and, therefore, likely better to use in passwords:

# Term Frequency-Inverse Document Frequency (TF-IDF)

# What: Computes "relative frequency" of a word that appears in a document compared to its frequency across all documents

# Why: More useful than "term frequency" for identifying "important" words/phrases in each document (high frequency in that document, low frequency in other documents)

from sklearn.feature_extraction.text import TfidfVectorizer
TF-IDF is commonly used for search-engine scoring, text summarization, and document clustering

We will begin by creating a vectorizer similar to the CountVectorizer we made earlier. ngram_range will be set to (1, 1) and the analyzer will be char:

one_tv = TfidfVectorizer(ngram_range=(1, 1), analyzer='char')

# once we instantiate the module, we will call upon the fit_transform method to learn the vocabulary and then
# transform our text series into a brand new matrix called one_char
# Previously we created a matrix of quantitative data by applying our own functions, now we are creating numerical
# matrices using sklearn
one_char_tf = one_tv.fit_transform(text)

# same shape as CountVectorizer
one_char_tf

<1048485x70 sparse matrix of type '<type 'numpy.float64'>' with 6935190 stored elements in Compressed Sparse Row format>

Let's use this new vectorizer to transform qwerty123:

# transforming a new password
pd.DataFrame(one_tv.transform(['qwerty123']).toarray(), columns=one_tv.get_feature_names())

We get the following output:

!

"

#

$

%

&

'

(

...

u

v

w

x

y

z

{

|

}

~

0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

...

0.0

0.0

0.408704

0.0

0.369502

0.0

0.0

0.0 0.0 0.0

 

The values in the table are no longer counts anymore; they are calculations involving relative frequency. Higher values indicate that the phrase is either—or both—of the following:

  • Used frequently in this password
  • Used infrequently throughout the corpus of passwords

Let's build a more complex vectorizer with phrases learned up to five characters:

# make a five-char TfidfVectorizer
five_tv = TfidfVectorizer(ngram_range=(1, 5), analyzer='char')

five_char_tf = five_tv.fit_transform(text)

# same shape as CountVectorizer
five_char_tf

<1048485x2570934 sparse matrix of type '<type 'numpy.float64'>' with 31053193 stored elements in Compressed Sparse Row format>

Let's use this new vectorizer to transform the simple abc123 password:

# Let's see some tfidf values of passwords

# store the feature names as a numpy array
features = np.array(five_tv.get_feature_names())

# transform a very simple password
abc_transformed = five_tv.transform(['abc123'])

# grab the non zero features that is, the ngrams that actually exist
features[abc_transformed.nonzero()[1]]


array([u'c123', u'c12', u'c1', u'c', u'bc123', u'bc12', u'bc1', u'bc', u'b', u'abc12', u'abc1', u'abc', u'ab', u'a', u'3', u'23', u'2', u'123', u'12', u'1'], dtype='<U5')

We will look at the non-zero tfidf scores, as shown:

# grab the non zero tfidf scores of the features
abc_transformed[abc_transformed.nonzero()]


matrix([[0.28865293, 0.27817216, 0.23180301, 0.10303378, 0.33609531, 0.33285593, 0.31079987, 0.23023187, 0.11165455, 0.33695385, 0.31813905, 0.25043863, 0.18481603, 0.07089031, 0.08285116, 0.13324432, 0.07449711, 0.15211427, 0.12089443, 0.06747844]])

# put them together in a DataFrame
pd.DataFrame(abc_transformed[abc_transformed.nonzero()],
columns=features[abc_transformed.nonzero()[1]])

Running the preceding code yields the table where you will find that the phrase 1 has a TF-IDF score of 0.067478 while bc123 has a score of 0.336095, implying that bc123 is more interesting than 1, which makes sense:

# Let's repeat the process with a slightly better password
password_transformed = five_tv.transform(['sdf%ERF'])

# grab the non zero features
features[password_transformed.nonzero()[1]]

# grab the non zero tfidf scores of the features
password_transformed[password_transformed.nonzero()]

# put them together in a DataFrame
pd.DataFrame(password_transformed[password_transformed.nonzero()], columns=features[password_transformed.nonzero()[1]])

Running the preceding code yields a table in which the larger TF-IDF values is %er versus 123 that is, (0.453607 versus 0.152114). This implies that %er is more interesting and occurs less often across the entire corpus. Also note that the TF-IDF value of %er is larger than anything found in the abc123 password, implying that this phrase alone is more interesting than anything found in abc123.

Let's take all of this a step further and introduce a mathematical function called the cosine similarity to judge the strength of new passwords that haven't been seen before.