In this section, we will start to manually create some features in order to quantify our textual passwords. Let's first create a new column in the data DataFrame called length, which will represent the length of the password:

# 1. the length of the password

# on the left of the equal sign, note we are defining a new column called 'length'. We want this column to hold the 
# length of the password. 

# on the right of the equal sign, we use the apply method of pandas Series/DFs. We will apply a function (len in this case)
# to every element in the column 'text'

data['length'] = data['text'].apply(len)

# see our changes take effect
data.head()

Here is the output:

	Text	Length
0	7606374520	10
1	piontekendre	12
2	rambo144	8
3	primoz123	9
4	sal1387	7

Let's use this new column to see the most common passwords of five or more characters:

# top passwords of length 5 or more
data[data.length > 4]["text"].value_counts()[:10]

123456       8
123456789    7
12345        6
43162        5
7758521      5
11111        5
5201314      5
111111       4
123321       4
102030       4

These seem more like what we expected; we even see 111111, which was on the list we saw at the beginning of this chapter. We continue now by adding another column, num_caps, that will count the number of capital letters in the password. This will eventually give us some insight into the strength of a password:

# store a new column
data['num_caps'] = data['text'].apply(caps)

# see our changes take effect
data.head(10)

We can now see our two new columns, both of which give us some quantifiable means of assessing password strength. Longer passwords with more capital letters tend to correlate to stronger passwords. But of course this is not the whole picture:

	Text	Length	num_caps
0	`7606374520`	10	0
1	`piontekendre`	12	0
2	`rambo144`	8	0
3	`primoz123`	9	0
4	`sal1387`	7	0
5	`EVASLRDG`	8	8
6	`Detroit84`	9	1
7	`dlbd090505`	10	0
8	`snoesje12`	9	0
9	`56412197`	8	0

We can visualize this data as a histogram of capital letters in passwords to see the distribution of the count of capital letters, which will give us a better sense of the overall usage of caps:

data['num_caps'].hist() # most passwords do not have any caps in them

Running this code will yield the following histogram, revealing a right skew of capital letters, meaning that most people stay on the lower end of capital letters:

Calling the describe method of the DataFrame will reveal some high-level descriptive statistics about our data:

# grab some basic descriptive statistics
data.describe()

Here is the output:

	Length	num_caps
count	1.048485e+06	1.048485e+06
mean	8.390173e+00	2.575392e-01
std	2.269470e+01	1.205588e+00
min	1.000000e+00	0.000000e+00
25%	7.000000e+00	0.000000e+00
50%	8.000000e+00	0.000000e+00
75%	9.000000e+00	0.000000e+00
max	8.192000e+03	2.690000e+02

The max row of the length attribute is telling us that we have some massive passwords (over 8,000 characters). We will isolate the passwords that are over 100 characters:

# let's see our long passwords
data[data.length > 100]

The long passwords can be seen here:

	Text	Length	num_caps
38830	`><script>alert(1)</script>\r123Lenda#\rhallibu...`	8192	242
387398	`\r251885394\rmello2\rmaitre1123\rfk6Ehruu\rthi...`	8192	176
451793	`39<0Y~c.;A1Bj\r3ddd4t\r516ks516\rag0931266\rac...`	8192	223
517600	`12345\rhdjcb100\r060571\rkaalimaa\rrelaxmax\rd...`	8192	184
580134	`or1=1--\r13817676085\r594112\rmactools\r880148...`	8192	216
752693	`pass\rmbmb266888\r1988luolin\r15877487956\rcri...`	8192	180
841857	`==)!)(=\raviral\rrimmir33\rhutcheson\rrr801201...`	8192	269
1013991	`AAj6H\rweebeth\rmonitor222\rem1981\ralexs123\r...`	8192	269

We can clearly see that eight of the rows of our DataFrame became malformed. To make this a bit easier, let's use pandas to get rid of these eight problematic rows. We could do work to sanitize this data; however, this case study will focus on deeper insights:

print data[data.length > 100].shape # only 8 rows that became malformed
# to make this easy, let's just drop those problematic rows

# we will drop passwords that are way too long
data.drop(data[data.length > 100].index, axis=0, inplace=True)
(8, 3)


# 1,048,485 - 8 == 1,048,477 makes sense
print data.shape
(1048477, 3)


data.describe()

The following table is the output of the preceding code:

	Length	num_caps
count	1.048477e+06	1.048477e+06
mean	8.327732e+00	2.558635e-01
std	2.012173e+00	1.037190e+00
min	1.000000e+00	0.000000e+00
25%	7.000000e+00	0.000000e+00
50%	8.000000e+00	0.000000e+00
75%	9.000000e+00	0.000000e+00
max	2.900000e+01	2.800000e+01

We will now turn to scikit-learn to add some automatic feature extraction.

Table of Contents for
Hands-On Machine Learning for Cybersecurity

Text feature extraction

Table of Contents for Hands-On Machine Learning for Cybersecurity

Table of Contents for
Hands-On Machine Learning for Cybersecurity