We will use AWID data for identifying impersonation. AWID is a family of datasets focused on intrusion detection. AWID datasets consist of packets of data, both large and small. These datasets are not inclusive of one another.

See http://icsdweb.aegean.gr/awid for more information.

Each version has a training set (denoted as Trn) and a test set (denoted as Tst). The test version has not been produced from the corresponding training set.

Finally, a version is provided where labels that correspond to different attacks (ATK), as well as a version where the attack labels are organized into three major classes (CLS). In that case the datasets only differ in the label:

Name	Classes	Size	Type	Records	Hours
AWID-ATK-F-Trn	10	Full	Train	162,375,247	96
AWID-ATK-F-Tst	17	Full	Test	48,524,866	12
AWID-CLS-F-Trn	4	Full	Train	162,375,247	96
AWID-CLS-F-Tst	4	Full	Test	48,524,866	12
AWID-ATK-R-Trn	10	Reduced	Train	1,795,575	1
AWID-ATK-R-Tst	15	Reduced	Test	575,643	1/3
AWID-CLS-R-Trn	4	Reduced	Train	1,795,575	1
AWID-CLS-R-Tst	4	Reduced	Test	530,643	1/3

This dataset has 155 attributes.

A detailed description is available at this link: http://icsdweb.aegean.gr/awid/features.html.

FIELD NAME	DESCRIPTION	TYPE	VERSIONS
`comment`	Comment	Character string	1.8.0 to 1.8.15
`frame.cap_len`	Frame length stored into the capture file	Unsigned integer, 4 bytes	1.0.0 to 2.6.4
`frame.coloring_rule.name`	Coloring Rule Name	Character string	1.0.0 to 2.6.4
`frame.coloring_rule.string`	Coloring Rule String	Character string	1.0.0 to 2.6.4
`frame.comment`	Comment	Character string	1.10.0 to 2.6.4
`frame.comment.expert`	Formatted comment	Label	1.12.0 to 2.6.4
`frame.dlt`	WTAP_ENCAP	Signed integer, 2 bytes	1.8.0 to 1.8.15
`frame.encap_type`	Encapsulation type	Signed integer, 2 bytes	1.10.0 to 2.6.4
`frame.file_off`	File Offset	Signed integer, 8 bytes	1.0.0 to 2.6.4
`frame.ignored`	Frame is ignored	Boolean	1.4.0 to 2.6.4
`frame.incomplete`	Incomplete dissector	Label	2.0.0 to 2.6.4
`frame.interface_description`	Interface description	Character string	2.4.0 to 2.6.4
`frame.interface_id`	Interface id	Unsigned integer, 4 bytes	1.8.0 to 2.6.4
`frame.interface_name`	Interface name	Character string	2.4.0 to 2.6.4
`frame.len`	Frame length on the wire	Unsigned integer, 4 bytes	1.0.0 to 2.6.4
`frame.link_nr`	Link Number	Unsigned integer, 2 bytes	1.0.0 to 2.6.4
`frame.marked`	Frame is marked	Boolean	1.0.0 to 2.6.4
`frame.md5_hash`	Frame MD5 Hash	Character string	1.2.0 to 2.6.4
`frame.number`	Frame Number	Unsigned integer, 4 bytes	1.0.0 to 2.6.4
`frame.offset_shift`	Time shift for this packet	Time offset	1.8.0 to 2.6.4
`frame.p2p_dir`	Point-to-Point Direction	Signed integer, 1 byte	1.0.0 to 2.6.4
`frame.p_prot_data`	Number of per-protocol-data	Unsigned integer, 4 bytes	1.10.0 to 1.12.13
`frame.packet_flags`	Packet flags	Unsigned integer, 4 bytes	1.10.0 to 2.6.4
`frame.packet_flags_crc_error`	CRC error	Boolean	1.10.0 to 2.6.4
`frame.packet_flags_direction`	Direction	Unsigned integer, 4 bytes	1.10.0 to 2.6.4
`frame.packet_flags_fcs_length`	FCS length	Unsigned integer, 4 bytes	1.10.0 to 2.6.4
`frame.packet_flags_packet_too_error`	Packet too long error	Boolean	1.10.0 to 2.6.4
`frame.packet_flags_packet_too_short_error`	Packet too short error	Boolean	1.10.0 to 2.6.4
`frame.packet_flags_preamble_error`	Preamble error	Boolean	1.10.0 to 2.6.4
`frame.packet_flags_reception_type`	Reception type	Unsigned integer, 4 bytes	1.10.0 to 2.6.4
`frame.packet_flags_reserved`	Reserved	Unsigned integer, 4 bytes	1.10.0 to 2.6.4
`frame.packet_flags_start_frame_delimiter_error`	Start frame delimiter error	Boolean	1.10.0 to 2.6.4
`frame.packet_flags_symbol_error`	Symbol error	Boolean	1.10.0 to 2.6.4
`frame.packet_flags_unaligned_frame_error`	Unaligned frame error	Boolean	1.10.0 to 2.6.4
`frame.packet_flags_wrong_inter_frame_gap_error`	Wrong interframe gap error	Boolean	1.10.0 to 2.6.4
`frame.pkt_len`	Frame length on the wire	Unsigned integer, 4 bytes	1.0.0 to 1.0.16
`frame.protocols`	Protocols in frame	Character string	1.0.0 to 2.6.4
`frame.ref_time`	This is a Time Reference frame	Label	1.0.0 to 2.6.4
`frame.time`	Arrival Time	Date and time	1.0.0 to 2.6.4
`frame.time_delta`	Time delta from previous captured frame	Time offset	1.0.0 to 2.6.4
`frame.time_delta_displayed`	Time delta from previous displayed frame	Time offset	1.0.0 to 2.6.4
`frame.time_epoch`	Epoch Time	Time offset	1.4.0 to 2.6.4
`frame.time_invalid`	Arrival Time: Fractional second out of range (0-1000000000)	Label	1.0.0 to 2.6.4
`frame.time_relative`	Time since reference or first frame	Time offset	1.0.0 to 2.6.4

The sample dataset is available in the GitHub library. The intrusion data is converted into a DataFrame using the Python pandas library:

import pandas as pd

The feature discussed earlier is imported into the DataFrame:

# get the names of the features    features = ['frame.interface_id', 'frame.dlt', 'frame.offset_shift', 'frame.time_epoch', 'frame.time_delta', 'frame.time_delta_displayed', 'frame.time_relative', 'frame.len', 'frame.cap_len', 'frame.marked', 'frame.ignored', 'radiotap.version', 'radiotap.pad', 'radiotap.length', 'radiotap.present.tsft', 'radiotap.present.flags', 'radiotap.present.rate', 'radiotap.present.channel', 'radiotap.present.fhss', 'radiotap.present.dbm_antsignal', 'radiotap.present.dbm_antnoise', 'radiotap.present.lock_quality', 'radiotap.present.tx_attenuation', 'radiotap.present.db_tx_attenuation', 'radiotap.present.dbm_tx_power', 'radiotap.present.antenna', 'radiotap.present.db_antsignal', 'radiotap.present.db_antnoise',........ 'wlan.qos.amsdupresent', 'wlan.qos.buf_state_indicated', 'wlan.qos.bit4', 'wlan.qos.txop_dur_req', 'wlan.qos.buf_state_indicated', 'data.len', 'class']

Next, we import the training dataset and count the number of rows and columns available in the dataset:

# import a training setawid = pd.read_csv("../data/AWID-CLS-R-Trn.csv", header=None, names=features)# see the number of rows/columnsawid.shape

The output can be seen as follows:

Out[4]:(1795575, 155)

The dataset uses ? as a null attribute. We will eventually have to replace them with None values:

awid.head()

The following code will display values around 5 rows × 155 columns from the table. Now we will see the distribution of response variables:

awid['class'].value_counts(normalize=True)

normal 0.909564injection 0.036411impersonation 0.027023flooding 0.027002Name: class, dtype: float64

We revisit the claims there are no null values because of the ? instances:

awid.isna().sum()

frame.interface_id 0frame.dlt 0frame.offset_shift 0frame.time_epoch 0frame.time_delta 0frame.time_delta_displayed 0frame.time_relative 0frame.len 0frame.cap_len 0frame.marked 0frame.ignored 0radiotap.version 0radiotap.pad 0radiotap.length 0radiotap.present.tsft 0radiotap.present.flags 0radiotap.present.rate 0radiotap.present.channel 0radiotap.present.fhss 0radiotap.present.dbm_antsignal 0radiotap.present.dbm_antnoise 0radiotap.present.lock_quality 0radiotap.present.tx_attenuation 0radiotap.present.db_tx_attenuation 0radiotap.present.dbm_tx_power 0radiotap.present.antenna 0radiotap.present.db_antsignal 0radiotap.present.db_antnoise 0radiotap.present.rxflags 0radiotap.present.xchannel 0                                                 ..wlan_mgt.rsn.version 0wlan_mgt.rsn.gcs.type 0wlan_mgt.rsn.pcs.count 0wlan_mgt.rsn.akms.count 0wlan_mgt.rsn.akms.type 0wlan_mgt.rsn.capabilities.preauth 0wlan_mgt.rsn.capabilities.no_pairwise 0wlan_mgt.rsn.capabilities.ptksa_replay_counter 0wlan_mgt.rsn.capabilities.gtksa_replay_counter 0wlan_mgt.rsn.capabilities.mfpr 0wlan_mgt.rsn.capabilities.mfpc 0wlan_mgt.rsn.capabilities.peerkey 0wlan_mgt.tcprep.trsmt_pow 0wlan_mgt.tcprep.link_mrg 0wlan.wep.iv 0wlan.wep.key 0wlan.wep.icv 0wlan.tkip.extiv 0wlan.ccmp.extiv 0wlan.qos.tid 0wlan.qos.priority 0wlan.qos.eosp 0wlan.qos.ack 0wlan.qos.amsdupresent 0wlan.qos.buf_state_indicated 0wlan.qos.bit4 0wlan.qos.txop_dur_req 0wlan.qos.buf_state_indicated.1 0data.len 0class 0Length: 155, dtype: int64

We replace the ? marks with None:

awid.replace({"?": None}, inplace=True

We count how many missing pieces of data are shown:

awid.isna().sum()

The output will be as follows:

frame.interface_id                                      0
frame.dlt                                         1795575
frame.offset_shift                                      0
frame.time_epoch                                        0
frame.time_delta                                        0
frame.time_delta_displayed                              0
frame.time_relative                                     0
frame.len                                               0
frame.cap_len                                           0
frame.marked                                            0
frame.ignored                                           0
radiotap.version                                        0
radiotap.pad                                            0
radiotap.length                                         0
radiotap.present.tsft                                   0
radiotap.present.flags                                  0
radiotap.present.rate                                   0
radiotap.present.channel                                0
radiotap.present.fhss                                   0
radiotap.present.dbm_antsignal                          0
radiotap.present.dbm_antnoise                           0
radiotap.present.lock_quality                           0
radiotap.present.tx_attenuation                         0
radiotap.present.db_tx_attenuation                      0
radiotap.present.dbm_tx_power                           0
radiotap.present.antenna                                0
radiotap.present.db_antsignal                           0
radiotap.present.db_antnoise                            0
radiotap.present.rxflags                                0
radiotap.present.xchannel                               0
                                                   ...   
wlan_mgt.rsn.version                              1718631
wlan_mgt.rsn.gcs.type                             1718631
wlan_mgt.rsn.pcs.count                            1718631
wlan_mgt.rsn.akms.count                           1718633
wlan_mgt.rsn.akms.type                            1718651
wlan_mgt.rsn.capabilities.preauth                 1718633
wlan_mgt.rsn.capabilities.no_pairwise             1718633
wlan_mgt.rsn.capabilities.ptksa_replay_counter    1718633
wlan_mgt.rsn.capabilities.gtksa_replay_counter    1718633
wlan_mgt.rsn.capabilities.mfpr                    1718633
wlan_mgt.rsn.capabilities.mfpc                    1718633
wlan_mgt.rsn.capabilities.peerkey                 1718633
wlan_mgt.tcprep.trsmt_pow                         1795536
wlan_mgt.tcprep.link_mrg                          1795536
wlan.wep.iv                                        944820
wlan.wep.key                                       909831
wlan.wep.icv                                       944820
wlan.tkip.extiv                                   1763655
wlan.ccmp.extiv                                   1792506
wlan.qos.tid                                      1133234
wlan.qos.priority                                 1133234
wlan.qos.eosp                                     1279874
wlan.qos.ack                                      1133234
wlan.qos.amsdupresent                             1134226
wlan.qos.buf_state_indicated                      1795575
wlan.qos.bit4                                     1648935
wlan.qos.txop_dur_req                             1648935
wlan.qos.buf_state_indicated.1                    1279874
data.len                                           903021
class                                                   0
Length: 155, dtype: int64

The goal here is to remove columns that have over 50% of their data missing:

columns_with_mostly_null_data = awid.columns[awid.isnull().mean() >= 0.5]

We see 72 columns are going to be affected:

columns_with_mostly_null_data.shape

The output is as follows:

(72,)

We drop the columns with over half of their data missing:

awid.drop(columns_with_mostly_null_data, axis=1, inplace=True)awid.shape

The preceding code gives the following output:

(1795575, 83)

Drop the rows that have missing values:

awid.dropna(inplace=True) # drop rows with null data

We lose 456,169 rows:

awid.shape

The following is the output of the preceding code:

(1339406, 83)

However, dropping doesn't affect our distribution too much:

# 0.878763 is our null accuracy. Our model must be better than this number to be a contenderawid['class'].value_counts(normalize=True)

The output can be seen as follows:

normal 0.878763injection 0.048812impersonation 0.036227flooding 0.036198Name: class, dtype: float64

Now we execute the following code:

# only select numeric columns for our ML algorithms, there should be more..
awid.select_dtypes(['number']).shape.

(1339406, 45)

# transform all columns into numerical dtypesfor col in awid.columns: awid[col] = pd.to_numeric(awid[col], errors='ignore')# that makes more senseawid.select_dtypes(['number']).shape

The preceding code gives the following output:

(1339406, 74)

Now execute the awid.describe() code as shown in the following snippet:

# basic descroptive statistics
awid.describe()

The output will display a table of 8 rows × 74 columns.

X, y = awid.select_dtypes(['number']), awid['class']

# do a basic naive bayes fitting
from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()

# fit our model to the data
nb.fit(X, y)

GaussianNB(priors=None)

We read in the test data and do the same transformations to it, to match the training data:

awid_test = pd.read_csv("../data/AWID-CLS-R-Tst.csv", header=None, names=features)
# drop the problematic columns
awid_test.drop(columns_with_mostly_null_data, axis=1, inplace=True)
# replace ? with None
awid_test.replace({"?": None}, inplace=True)
# drop the rows with null data
awid_test.dropna(inplace=True)  # drop rows with null data
# convert columns to numerical values
for col in awid_test.columns:
    awid_test[col] = pd.to_numeric(awid_test[col], errors='ignore')
awid_test.shape

The output can be seen as follows:

Out[45]:(389185, 83)

To check basic metric, accuracy of the code:

from sklearn.metrics import accuracy_score

X_test = awid_test.select_dtypes(['number'])
y_test = awid_test['class']

# simple function to test the accuracy of a model fitted on training data on our testing data
def get_test_accuracy_of(model):
    y_preds = model.predict(X_test)
    return accuracy_score(y_preds, y_test)

# naive abyes does very poorly on its own!
get_test_accuracy_of(nb)

The output is seen as follows:

0.26535452291326744

We will be using logistic regression for the following problem:

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

lr.fit(X, y)

# Logistic Regressions does even worse
get_test_accuracy_of(lr)

The following is the output:

0.015773989233911892

Importing a decision tree classifier, we get the following:

from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier()

tree.fit(X, y)

# Tree does very well!
get_test_accuracy_of(tree)

The output looks like this:

0.9336639387437851

We see the Gini scores of the decision tree's features:

pd.DataFrame({'feature':awid.select_dtypes(['number']).columns, 
              'importance':tree.feature_importances_}).sort_values('importance', ascending=False).head(10)

We will get output like this:

feature	importance
6	`frame.len`	0.230466
3	`frame.time_delta`	0.221151
68	`wlan.fc.protected`	0.145760
70	`wlan.duration`	0.127612
5	`frame.time_relative`	0.079571
7	`frame.cap_len`	0.059702
62	`wlan.fc.type`	0.040192
72	`wlan.seq`	0.026807
65	`wlan.fc.retry`	0.019807
58	`radiotap.dbm_antsignal`	0.014195


from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier()

forest.fit(X, y)

# Random Forest does slightly worse
get_test_accuracy_of(forest)

The output can be seen as follows:

0.9297326464277914

Create a pipeline that will scale the numerical data and then feed the resulting data into a decision tree:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

preprocessing = Pipeline([
    ("scale", StandardScaler()),
])

pipeline = Pipeline([
    ("preprocessing", preprocessing),
    ("classifier", DecisionTreeClassifier())
])

# try varying levels of depth
params = {
    "classifier__max_depth": [None, 3, 5, 10], 
         }

# instantiate a gridsearch module
grid = GridSearchCV(pipeline, params)
# fit the module
grid.fit(X, y)

# test the best model
get_test_accuracy_of(grid.best_estimator_)

The following shows the output:

0.9254930174595629

We try the same thing with a random forest:

preprocessing = Pipeline([
    ("scale", StandardScaler()),
])

pipeline = Pipeline([
    ("preprocessing", preprocessing),
    ("classifier", RandomForestClassifier())
])

# try varying levels of depth
params = {
    "classifier__max_depth": [None, 3, 5, 10], 
         }

grid = GridSearchCV(pipeline, params)
grid.fit(X, y)
# best accuracy so far!
get_test_accuracy_of(grid.best_estimator_)

The final accuracy is as follows:

0.9348176317175636

Table of Contents for
Hands-On Machine Learning for Cybersecurity

Identifying impersonation as a means of intrusion detection

Table of Contents for Hands-On Machine Learning for Cybersecurity

Table of Contents for
Hands-On Machine Learning for Cybersecurity