We will use AWID data for identifying impersonation. AWID is a family of datasets focused on intrusion detection. AWID datasets consist of packets of data, both large and small. These datasets are not inclusive of one another.
Each version has a training set (denoted as Trn) and a test set (denoted as Tst). The test version has not been produced from the corresponding training set.
Finally, a version is provided where labels that correspond to different attacks (ATK), as well as a version where the attack labels are organized into three major classes (CLS). In that case the datasets only differ in the label:
| Name | Classes | Size | Type | Records | Hours |
| AWID-ATK-F-Trn | 10 | Full | Train | 162,375,247 | 96 |
| AWID-ATK-F-Tst | 17 | Full | Test | 48,524,866 | 12 |
| AWID-CLS-F-Trn | 4 | Full | Train | 162,375,247 | 96 |
| AWID-CLS-F-Tst | 4 | Full | Test | 48,524,866 | 12 |
| AWID-ATK-R-Trn | 10 | Reduced | Train | 1,795,575 | 1 |
| AWID-ATK-R-Tst | 15 | Reduced | Test | 575,643 | 1/3 |
| AWID-CLS-R-Trn | 4 | Reduced | Train | 1,795,575 | 1 |
| AWID-CLS-R-Tst | 4 | Reduced | Test | 530,643 | 1/3 |
This dataset has 155 attributes.
|
FIELD NAME |
DESCRIPTION |
TYPE |
VERSIONS |
|---|---|---|---|
| comment |
Comment |
Character string |
1.8.0 to 1.8.15 |
| frame.cap_len |
Frame length stored into the capture file |
Unsigned integer, 4 bytes |
1.0.0 to 2.6.4 |
| frame.coloring_rule.name |
Coloring Rule Name |
Character string |
1.0.0 to 2.6.4 |
| frame.coloring_rule.string |
Coloring Rule String |
Character string |
1.0.0 to 2.6.4 |
| frame.comment |
Comment |
Character string |
1.10.0 to 2.6.4 |
| frame.comment.expert |
Formatted comment |
Label |
1.12.0 to 2.6.4 |
| frame.dlt |
WTAP_ENCAP |
Signed integer, 2 bytes |
1.8.0 to 1.8.15 |
| frame.encap_type |
Encapsulation type |
Signed integer, 2 bytes |
1.10.0 to 2.6.4 |
| frame.file_off |
File Offset |
Signed integer, 8 bytes |
1.0.0 to 2.6.4 |
| frame.ignored |
Frame is ignored |
Boolean |
1.4.0 to 2.6.4 |
| frame.incomplete |
Incomplete dissector |
Label |
2.0.0 to 2.6.4 |
| frame.interface_description |
Interface description |
Character string |
2.4.0 to 2.6.4 |
| frame.interface_id |
Interface id |
Unsigned integer, 4 bytes |
1.8.0 to 2.6.4 |
| frame.interface_name |
Interface name |
Character string |
2.4.0 to 2.6.4 |
| frame.len |
Frame length on the wire |
Unsigned integer, 4 bytes |
1.0.0 to 2.6.4 |
| frame.link_nr |
Link Number |
Unsigned integer, 2 bytes |
1.0.0 to 2.6.4 |
| frame.marked |
Frame is marked |
Boolean |
1.0.0 to 2.6.4 |
| frame.md5_hash |
Frame MD5 Hash |
Character string |
1.2.0 to 2.6.4 |
| frame.number |
Frame Number |
Unsigned integer, 4 bytes |
1.0.0 to 2.6.4 |
| frame.offset_shift |
Time shift for this packet |
Time offset |
1.8.0 to 2.6.4 |
| frame.p2p_dir |
Point-to-Point Direction |
Signed integer, 1 byte |
1.0.0 to 2.6.4 |
| frame.p_prot_data |
Number of per-protocol-data |
Unsigned integer, 4 bytes |
1.10.0 to 1.12.13 |
| frame.packet_flags |
Packet flags |
Unsigned integer, 4 bytes |
1.10.0 to 2.6.4 |
| frame.packet_flags_crc_error |
CRC error |
Boolean |
1.10.0 to 2.6.4 |
| frame.packet_flags_direction |
Direction |
Unsigned integer, 4 bytes |
1.10.0 to 2.6.4 |
| frame.packet_flags_fcs_length |
FCS length |
Unsigned integer, 4 bytes |
1.10.0 to 2.6.4 |
| frame.packet_flags_packet_too_error |
Packet too long error |
Boolean |
1.10.0 to 2.6.4 |
| frame.packet_flags_packet_too_short_error |
Packet too short error |
Boolean |
1.10.0 to 2.6.4 |
| frame.packet_flags_preamble_error |
Preamble error |
Boolean |
1.10.0 to 2.6.4 |
| frame.packet_flags_reception_type |
Reception type |
Unsigned integer, 4 bytes |
1.10.0 to 2.6.4 |
| frame.packet_flags_reserved |
Reserved |
Unsigned integer, 4 bytes |
1.10.0 to 2.6.4 |
| frame.packet_flags_start_frame_delimiter_error |
Start frame delimiter error |
Boolean |
1.10.0 to 2.6.4 |
| frame.packet_flags_symbol_error |
Symbol error |
Boolean |
1.10.0 to 2.6.4 |
| frame.packet_flags_unaligned_frame_error |
Unaligned frame error |
Boolean |
1.10.0 to 2.6.4 |
| frame.packet_flags_wrong_inter_frame_gap_error |
Wrong interframe gap error |
Boolean |
1.10.0 to 2.6.4 |
| frame.pkt_len |
Frame length on the wire |
Unsigned integer, 4 bytes |
1.0.0 to 1.0.16 |
| frame.protocols |
Protocols in frame |
Character string |
1.0.0 to 2.6.4 |
| frame.ref_time |
This is a Time Reference frame |
Label |
1.0.0 to 2.6.4 |
| frame.time |
Arrival Time |
Date and time |
1.0.0 to 2.6.4 |
| frame.time_delta |
Time delta from previous captured frame |
Time offset |
1.0.0 to 2.6.4 |
| frame.time_delta_displayed |
Time delta from previous displayed frame |
Time offset |
1.0.0 to 2.6.4 |
| frame.time_epoch |
Epoch Time |
Time offset |
1.4.0 to 2.6.4 |
| frame.time_invalid |
Arrival Time: Fractional second out of range (0-1000000000) |
Label |
1.0.0 to 2.6.4 |
| frame.time_relative |
Time since reference or first frame |
Time offset |
1.0.0 to 2.6.4 |
The sample dataset is available in the GitHub library. The intrusion data is converted into a DataFrame using the Python pandas library:
import pandas as pdThe feature discussed earlier is imported into the DataFrame:
# get the names of the features features = ['frame.interface_id', 'frame.dlt', 'frame.offset_shift', 'frame.time_epoch', 'frame.time_delta', 'frame.time_delta_displayed', 'frame.time_relative', 'frame.len', 'frame.cap_len', 'frame.marked', 'frame.ignored', 'radiotap.version', 'radiotap.pad', 'radiotap.length', 'radiotap.present.tsft', 'radiotap.present.flags', 'radiotap.present.rate', 'radiotap.present.channel', 'radiotap.present.fhss', 'radiotap.present.dbm_antsignal', 'radiotap.present.dbm_antnoise', 'radiotap.present.lock_quality', 'radiotap.present.tx_attenuation', 'radiotap.present.db_tx_attenuation', 'radiotap.present.dbm_tx_power', 'radiotap.present.antenna', 'radiotap.present.db_antsignal', 'radiotap.present.db_antnoise',........ 'wlan.qos.amsdupresent', 'wlan.qos.buf_state_indicated', 'wlan.qos.bit4', 'wlan.qos.txop_dur_req', 'wlan.qos.buf_state_indicated', 'data.len', 'class']
Next, we import the training dataset and count the number of rows and columns available in the dataset:
# import a training setawid = pd.read_csv("../data/AWID-CLS-R-Trn.csv", header=None, names=features)# see the number of rows/columnsawid.shape
The output can be seen as follows:
Out[4]:(1795575, 155)
The dataset uses ? as a null attribute. We will eventually have to replace them with None values:
awid.head()
The following code will display values around 5 rows × 155 columns from the table. Now we will see the distribution of response variables:
awid['class'].value_counts(normalize=True)We revisit the claims there are no null values because of the ? instances:
awid.isna().sum()
We replace the ? marks with None:
awid.replace({"?": None}, inplace=True
We count how many missing pieces of data are shown:
awid.isna().sum()
The output will be as follows:
The goal here is to remove columns that have over 50% of their data missing:
columns_with_mostly_null_data = awid.columns[awid.isnull().mean() >= 0.5]
We see 72 columns are going to be affected:
columns_with_mostly_null_data.shape
The output is as follows:
We drop the columns with over half of their data missing:
awid.drop(columns_with_mostly_null_data, axis=1, inplace=True)awid.shape
The preceding code gives the following output:
Drop the rows that have missing values:
awid.dropna(inplace=True) # drop rows with null data
We lose 456,169 rows:
awid.shape
The following is the output of the preceding code:
However, dropping doesn't affect our distribution too much:
# 0.878763 is our null accuracy. Our model must be better than this number to be a contenderawid['class'].value_counts(normalize=True)
The output can be seen as follows:
normal 0.878763injection 0.048812impersonation 0.036227flooding 0.036198Name: class, dtype: float64
Now we execute the following code:
# only select numeric columns for our ML algorithms, there should be more..
awid.select_dtypes(['number']).shape.
(1339406, 45)
# transform all columns into numerical dtypesfor col in awid.columns: awid[col] = pd.to_numeric(awid[col], errors='ignore')# that makes more senseawid.select_dtypes(['number']).shapeThe preceding code gives the following output:
Now execute the awid.describe() code as shown in the following snippet:
# basic descroptive statistics
awid.describe()The output will display a table of 8 rows × 74 columns.
X, y = awid.select_dtypes(['number']), awid['class']# do a basic naive bayes fitting
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
# fit our model to the data
nb.fit(X, y)awid_test = pd.read_csv("../data/AWID-CLS-R-Tst.csv", header=None, names=features)
# drop the problematic columns
awid_test.drop(columns_with_mostly_null_data, axis=1, inplace=True)
# replace ? with None
awid_test.replace({"?": None}, inplace=True)
# drop the rows with null data
awid_test.dropna(inplace=True) # drop rows with null data
# convert columns to numerical values
for col in awid_test.columns:
awid_test[col] = pd.to_numeric(awid_test[col], errors='ignore')
awid_test.shape
The output can be seen as follows:
from sklearn.metrics import accuracy_scoreX_test = awid_test.select_dtypes(['number'])
y_test = awid_test['class']
# simple function to test the accuracy of a model fitted on training data on our testing data
def get_test_accuracy_of(model):
y_preds = model.predict(X_test)
return accuracy_score(y_preds, y_test)
# naive abyes does very poorly on its own!
get_test_accuracy_of(nb)
The output is seen as follows:
We will be using logistic regression for the following problem:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X, y)
# Logistic Regressions does even worse
get_test_accuracy_of(lr)
The following is the output:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier()
tree.fit(X, y)
# Tree does very well!
get_test_accuracy_of(tree)The output looks like this:
We see the Gini scores of the decision tree's features:
pd.DataFrame({'feature':awid.select_dtypes(['number']).columns,
'importance':tree.feature_importances_}).sort_values('importance', ascending=False).head(10)We will get output like this:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier()
forest.fit(X, y)
# Random Forest does slightly worse
get_test_accuracy_of(forest)The output can be seen as follows:
Create a pipeline that will scale the numerical data and then feed the resulting data into a decision tree:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
preprocessing = Pipeline([
("scale", StandardScaler()),
])
pipeline = Pipeline([
("preprocessing", preprocessing),
("classifier", DecisionTreeClassifier())
])
# try varying levels of depth
params = {
"classifier__max_depth": [None, 3, 5, 10],
}
# instantiate a gridsearch module
grid = GridSearchCV(pipeline, params)
# fit the module
grid.fit(X, y)
# test the best model
get_test_accuracy_of(grid.best_estimator_)The following shows the output:
We try the same thing with a random forest:
preprocessing = Pipeline([
("scale", StandardScaler()),
])
pipeline = Pipeline([
("preprocessing", preprocessing),
("classifier", RandomForestClassifier())
])
# try varying levels of depth
params = {
"classifier__max_depth": [None, 3, 5, 10],
}
grid = GridSearchCV(pipeline, params)
grid.fit(X, y)
# best accuracy so far!
get_test_accuracy_of(grid.best_estimator_)The final accuracy is as follows: