Table of Contents for
Hands-On Machine Learning for Cybersecurity

Version ebook / Retour

Cover image for bash Cookbook, 2nd Edition Hands-On Machine Learning for Cybersecurity by Sinan Ozdemir Published by Packt Publishing, 2018
  1. Hands-on Machine Learning for Cybersecurity
  2. Title Page
  3. Copyright and Credits
  4. Hands-On Machine Learning for Cybersecurity
  5. About Packt
  6. Why subscribe?
  7. Packt.com
  8. Contributors
  9. About the authors
  10. About the reviewers
  11. Packt is searching for authors like you
  12. Table of Contents
  13. Preface
  14. Who this book is for
  15. What this book covers
  16. To get the most out of this book
  17. Download the example code files
  18. Download the color images
  19. Conventions used
  20. Get in touch
  21. Reviews
  22. Basics of Machine Learning in Cybersecurity
  23. What is machine learning?
  24. Problems that machine learning solves
  25. Why use machine learning in cybersecurity?
  26. Current cybersecurity solutions
  27. Data in machine learning
  28. Structured versus unstructured data
  29. Labelled versus unlabelled data
  30. Machine learning phases
  31. Inconsistencies in data
  32. Overfitting
  33. Underfitting
  34. Different types of machine learning algorithm
  35. Supervised learning algorithms
  36. Unsupervised learning algorithms 
  37. Reinforcement learning
  38. Another categorization of machine learning
  39. Classification problems
  40. Clustering problems
  41. Regression problems
  42. Dimensionality reduction problems
  43. Density estimation problems
  44. Deep learning
  45. Algorithms in machine learning
  46. Support vector machines
  47. Bayesian networks
  48. Decision trees
  49. Random forests
  50. Hierarchical algorithms
  51. Genetic algorithms
  52. Similarity algorithms
  53. ANNs
  54. The machine learning architecture
  55. Data ingestion
  56. Data store
  57. The model engine
  58. Data preparation 
  59. Feature generation
  60. Training
  61. Testing
  62. Performance tuning
  63. Mean squared error
  64. Mean absolute error
  65. Precision, recall, and accuracy
  66. How can model performance be improved?
  67. Fetching the data to improve performance
  68. Switching machine learning algorithms
  69. Ensemble learning to improve performance
  70. Hands-on machine learning
  71. Python for machine learning
  72. Comparing Python 2.x with 3.x 
  73. Python installation 
  74. Python interactive development environment
  75. Jupyter Notebook installation
  76. Python packages
  77. NumPy
  78. SciPy
  79. Scikit-learn 
  80. pandas
  81. Matplotlib
  82. Mongodb with Python
  83. Installing MongoDB
  84. PyMongo
  85. Setting up the development and testing environment
  86. Use case
  87. Data
  88. Code
  89. Summary
  90. Time Series Analysis and Ensemble Modeling
  91. What is a time series?
  92. Time series analysis
  93. Stationarity of a time series models
  94. Strictly stationary process
  95. Correlation in time series
  96. Autocorrelation
  97. Partial autocorrelation function
  98. Classes of time series models
  99. Stochastic time series model
  100. Artificial neural network time series model
  101.  Support vector time series models
  102. Time series components
  103. Systematic models
  104. Non-systematic models
  105. Time series decomposition
  106. Level 
  107. Trend 
  108. Seasonality 
  109. Noise 
  110. Use cases for time series
  111. Signal processing
  112. Stock market predictions
  113. Weather forecasting
  114. Reconnaissance detection
  115. Time series analysis in cybersecurity
  116. Time series trends and seasonal spikes
  117. Detecting distributed denial of series with time series
  118. Dealing with the time element in time series
  119. Tackling the use case
  120. Importing packages
  121. Importing data in pandas
  122. Data cleansing and transformation
  123. Feature computation
  124. Predicting DDoS attacks
  125. ARMA
  126. ARIMA
  127. ARFIMA
  128. Ensemble learning methods
  129. Types of ensembling
  130. Averaging
  131. Majority vote
  132. Weighted average
  133. Types of ensemble algorithm
  134. Bagging
  135. Boosting
  136. Stacking
  137. Bayesian parameter averaging
  138. Bayesian model combination
  139. Bucket of models
  140. Cybersecurity with ensemble techniques
  141. Voting ensemble method to detect cyber attacks
  142. Summary
  143. Segregating Legitimate and Lousy URLs
  144. Introduction to the types of abnormalities in URLs
  145. URL blacklisting
  146. Drive-by download URLs
  147. Command and control URLs
  148. Phishing URLs
  149. Using heuristics to detect malicious pages
  150. Data for the analysis
  151. Feature extraction
  152. Lexical features
  153. Web-content-based features
  154. Host-based features
  155. Site-popularity features
  156. Using machine learning to detect malicious URLs 
  157. Logistic regression to detect malicious URLs
  158. Dataset
  159. Model
  160. TF-IDF
  161. SVM to detect malicious URLs
  162. Multiclass classification for URL classification
  163. One-versus-rest
  164. Summary
  165. Knocking Down CAPTCHAs
  166. Characteristics of CAPTCHA
  167. Using artificial intelligence to crack CAPTCHA
  168. Types of CAPTCHA
  169. reCAPTCHA
  170. No CAPTCHA reCAPTCHA
  171. Breaking a CAPTCHA
  172. Solving CAPTCHAs with a neural network
  173. Dataset 
  174. Packages
  175. Theory of CNN
  176. Model
  177. Code
  178. Training the model
  179. Testing the model 
  180. Summary
  181. Using Data Science to Catch Email Fraud and Spam
  182. Email spoofing 
  183. Bogus offers
  184. Requests for help
  185. Types of spam emails
  186. Deceptive emails
  187. CEO fraud
  188. Pharming 
  189. Dropbox phishing
  190. Google Docs phishing
  191. Spam detection
  192. Types of mail servers 
  193. Data collection from mail servers
  194. Using the Naive Bayes theorem to detect spam
  195. Laplace smoothing
  196. Featurization techniques that convert text-based emails into numeric values
  197. Log-space
  198. TF-IDF
  199. N-grams
  200. Tokenization
  201. Logistic regression spam filters
  202. Logistic regression
  203. Dataset
  204. Python
  205. Results
  206. Summary
  207. Efficient Network Anomaly Detection Using k-means
  208. Stages of a network attack
  209. Phase 1 – Reconnaissance 
  210. Phase 2 – Initial compromise 
  211. Phase 3 – Command and control 
  212. Phase 4 – Lateral movement
  213. Phase 5 – Target attainment 
  214. Phase 6 – Ex-filtration, corruption, and disruption 
  215. Dealing with lateral movement in networks
  216. Using Windows event logs to detect network anomalies
  217. Logon/Logoff events 
  218. Account logon events
  219. Object access events
  220. Account management events
  221. Active directory events
  222. Ingesting active directory data
  223. Data parsing
  224. Modeling
  225. Detecting anomalies in a network with k-means
  226. Network intrusion data
  227. Coding the network intrusion attack
  228. Model evaluation 
  229. Sum of squared errors
  230. Choosing k for k-means
  231. Normalizing features
  232. Manual verification
  233. Summary
  234. Decision Tree and Context-Based Malicious Event Detection
  235. Adware
  236. Bots
  237. Bugs
  238. Ransomware
  239. Rootkit
  240. Spyware
  241. Trojan horses
  242. Viruses
  243. Worms
  244. Malicious data injection within databases
  245. Malicious injections in wireless sensors
  246. Use case
  247. The dataset
  248. Importing packages 
  249. Features of the data
  250. Model
  251. Decision tree 
  252. Types of decision trees
  253. Categorical variable decision tree
  254. Continuous variable decision tree
  255. Gini coeffiecient
  256. Random forest
  257. Anomaly detection
  258. Isolation forest
  259. Supervised and outlier detection with Knowledge Discovery Databases (KDD)
  260. Revisiting malicious URL detection with decision trees
  261. Summary
  262. Catching Impersonators and Hackers Red Handed
  263. Understanding impersonation
  264. Different types of impersonation fraud 
  265. Impersonators gathering information
  266. How an impersonation attack is constructed
  267. Using data science to detect domains that are impersonations
  268. Levenshtein distance
  269. Finding domain similarity between malicious URLs
  270. Authorship attribution
  271. AA detection for tweets
  272. Difference between test and validation datasets
  273. Sklearn pipeline
  274. Naive Bayes classifier for multinomial models
  275. Identifying impersonation as a means of intrusion detection 
  276. Summary
  277. Changing the Game with TensorFlow
  278. Introduction to TensorFlow
  279. Installation of TensorFlow
  280. TensorFlow for Windows users
  281. Hello world in TensorFlow
  282. Importing the MNIST dataset
  283. Computation graphs
  284. What is a computation graph?
  285. Tensor processing unit
  286. Using TensorFlow for intrusion detection
  287. Summary
  288. Financial Fraud and How Deep Learning Can Mitigate It
  289. Machine learning to detect financial fraud
  290. Imbalanced data
  291. Handling imbalanced datasets
  292. Random under-sampling
  293. Random oversampling
  294. Cluster-based oversampling
  295. Synthetic minority oversampling technique
  296. Modified synthetic minority oversampling technique
  297. Detecting credit card fraud
  298. Logistic regression
  299. Loading the dataset
  300. Approach
  301. Logistic regression classifier – under-sampled data
  302. Tuning hyperparameters 
  303. Detailed classification reports
  304. Predictions on test sets and plotting a confusion matrix
  305. Logistic regression classifier – skewed data
  306. Investigating precision-recall curve and area
  307. Deep learning time
  308. Adam gradient optimizer
  309. Summary
  310. Case Studies
  311. Introduction to our password dataset
  312. Text feature extraction
  313. Feature extraction with scikit-learn
  314. Using the cosine similarity to quantify bad passwords
  315. Putting it all together
  316. Summary
  317. Other Books You May Enjoy
  318. Leave a review - let other readers know what you think

Identifying impersonation as a means of intrusion detection 

We will use AWID data for identifying impersonation. AWID is a family of datasets focused on intrusion detection. AWID datasets consist of packets of data, both large and small. These datasets are not inclusive of one another.

See http://icsdweb.aegean.gr/awid for more information.

Each version has a training set (denoted as Trn) and a test set (denoted as Tst). The test version has not been produced from the corresponding training set.

Finally, a version is provided where labels that correspond to different attacks (ATK), as well as a version where the attack labels are organized into three major classes (CLS). In that case the datasets only differ in the label:

Name Classes Size Type Records Hours
AWID-ATK-F-Trn 10 Full Train 162,375,247 96
AWID-ATK-F-Tst 17 Full Test 48,524,866 12
AWID-CLS-F-Trn 4 Full Train 162,375,247 96
AWID-CLS-F-Tst 4 Full Test 48,524,866 12
AWID-ATK-R-Trn 10 Reduced Train 1,795,575 1
AWID-ATK-R-Tst 15 Reduced Test 575,643 1/3
AWID-CLS-R-Trn 4 Reduced Train 1,795,575 1
AWID-CLS-R-Tst 4 Reduced Test 530,643 1/3

 

This dataset has 155 attributes.

A detailed description is available at this link: http://icsdweb.aegean.gr/awid/features.html.

FIELD NAME

DESCRIPTION

TYPE

VERSIONS

comment

Comment

Character string

1.8.0 to 1.8.15

frame.cap_len

Frame length stored into the capture file

Unsigned integer, 4 bytes

1.0.0 to 2.6.4

frame.coloring_rule.name

Coloring Rule Name

Character string

1.0.0 to 2.6.4

frame.coloring_rule.string

Coloring Rule String

Character string

1.0.0 to 2.6.4

frame.comment

Comment

Character string

1.10.0 to 2.6.4

frame.comment.expert

Formatted comment

Label

1.12.0 to 2.6.4

frame.dlt

WTAP_ENCAP

Signed integer, 2 bytes

1.8.0 to 1.8.15

frame.encap_type

Encapsulation type

Signed integer, 2 bytes

1.10.0 to 2.6.4

frame.file_off

File Offset

Signed integer, 8 bytes

1.0.0 to 2.6.4

frame.ignored

Frame is ignored

Boolean

1.4.0 to 2.6.4

frame.incomplete

Incomplete dissector

Label

2.0.0 to 2.6.4

frame.interface_description

Interface description

Character string

2.4.0 to 2.6.4

frame.interface_id

Interface id

Unsigned integer, 4 bytes

1.8.0 to 2.6.4

frame.interface_name

Interface name

Character string

2.4.0 to 2.6.4

frame.len

Frame length on the wire

Unsigned integer, 4 bytes

1.0.0 to 2.6.4

frame.link_nr

Link Number

Unsigned integer, 2 bytes

1.0.0 to 2.6.4

frame.marked

Frame is marked

Boolean

1.0.0 to 2.6.4

frame.md5_hash

Frame MD5 Hash

Character string

1.2.0 to 2.6.4

frame.number

Frame Number

Unsigned integer, 4 bytes

1.0.0 to 2.6.4

frame.offset_shift

Time shift for this packet

Time offset

1.8.0 to 2.6.4

frame.p2p_dir

Point-to-Point Direction

Signed integer, 1 byte

1.0.0 to 2.6.4

frame.p_prot_data

Number of per-protocol-data

Unsigned integer, 4 bytes

1.10.0 to 1.12.13

frame.packet_flags

Packet flags

Unsigned integer, 4 bytes

1.10.0 to 2.6.4

frame.packet_flags_crc_error

CRC error

Boolean

1.10.0 to 2.6.4

frame.packet_flags_direction

Direction

Unsigned integer, 4 bytes

1.10.0 to 2.6.4

frame.packet_flags_fcs_length

FCS length

Unsigned integer, 4 bytes

1.10.0 to 2.6.4

frame.packet_flags_packet_too_error

Packet too long error

Boolean

1.10.0 to 2.6.4

frame.packet_flags_packet_too_short_error

Packet too short error

Boolean

1.10.0 to 2.6.4

frame.packet_flags_preamble_error

Preamble error

Boolean

1.10.0 to 2.6.4

frame.packet_flags_reception_type

Reception type

Unsigned integer, 4 bytes

1.10.0 to 2.6.4

frame.packet_flags_reserved

Reserved

Unsigned integer, 4 bytes

1.10.0 to 2.6.4

frame.packet_flags_start_frame_delimiter_error

Start frame delimiter error

Boolean

1.10.0 to 2.6.4

frame.packet_flags_symbol_error

Symbol error

Boolean

1.10.0 to 2.6.4

frame.packet_flags_unaligned_frame_error

Unaligned frame error

Boolean

1.10.0 to 2.6.4

frame.packet_flags_wrong_inter_frame_gap_error

Wrong interframe gap error

Boolean

1.10.0 to 2.6.4

frame.pkt_len

Frame length on the wire

Unsigned integer, 4 bytes

1.0.0 to 1.0.16

frame.protocols

Protocols in frame

Character string

1.0.0 to 2.6.4

frame.ref_time

This is a Time Reference frame

Label

1.0.0 to 2.6.4

frame.time

Arrival Time

Date and time

1.0.0 to 2.6.4

frame.time_delta

Time delta from previous captured frame

Time offset

1.0.0 to 2.6.4

frame.time_delta_displayed

Time delta from previous displayed frame

Time offset

1.0.0 to 2.6.4

frame.time_epoch

Epoch Time

Time offset

1.4.0 to 2.6.4

frame.time_invalid

Arrival Time: Fractional second out of range (0-1000000000)

Label

1.0.0 to 2.6.4

frame.time_relative

Time since reference or first frame

Time offset

1.0.0 to 2.6.4

 

The sample dataset is available in the GitHub library. The intrusion data is converted into a DataFrame using the Python pandas library:

import pandas as pd

The feature discussed earlier is imported into the DataFrame:

# get the names of the features    features = ['frame.interface_id', 'frame.dlt', 'frame.offset_shift', 'frame.time_epoch', 'frame.time_delta', 'frame.time_delta_displayed', 'frame.time_relative', 'frame.len', 'frame.cap_len', 'frame.marked', 'frame.ignored', 'radiotap.version', 'radiotap.pad', 'radiotap.length', 'radiotap.present.tsft', 'radiotap.present.flags', 'radiotap.present.rate', 'radiotap.present.channel', 'radiotap.present.fhss', 'radiotap.present.dbm_antsignal', 'radiotap.present.dbm_antnoise', 'radiotap.present.lock_quality', 'radiotap.present.tx_attenuation', 'radiotap.present.db_tx_attenuation', 'radiotap.present.dbm_tx_power', 'radiotap.present.antenna', 'radiotap.present.db_antsignal', 'radiotap.present.db_antnoise',........ 'wlan.qos.amsdupresent', 'wlan.qos.buf_state_indicated', 'wlan.qos.bit4', 'wlan.qos.txop_dur_req', 'wlan.qos.buf_state_indicated', 'data.len', 'class']

Next, we import the training dataset and count the number of rows and columns available in the dataset:

# import a training setawid = pd.read_csv("../data/AWID-CLS-R-Trn.csv", header=None, names=features)# see the number of rows/columnsawid.shape

The output can be seen as follows:

Out[4]:(1795575, 155)

The dataset uses ? as a null attribute. We will eventually have to replace them with None values:

awid.head()

The following code will display values around rows × 155 columns from the table. Now  we will see the distribution of response variables:

awid['class'].value_counts(normalize=True)
normal 0.909564injection 0.036411impersonation 0.027023flooding 0.027002Name: class, dtype: float64

We revisit the claims there are no null values because of the ? instances:

awid.isna().sum()
frame.interface_id 0frame.dlt 0frame.offset_shift 0frame.time_epoch 0frame.time_delta 0frame.time_delta_displayed 0frame.time_relative 0frame.len 0frame.cap_len 0frame.marked 0frame.ignored 0radiotap.version 0radiotap.pad 0radiotap.length 0radiotap.present.tsft 0radiotap.present.flags 0radiotap.present.rate 0radiotap.present.channel 0radiotap.present.fhss 0radiotap.present.dbm_antsignal 0radiotap.present.dbm_antnoise 0radiotap.present.lock_quality 0radiotap.present.tx_attenuation 0radiotap.present.db_tx_attenuation 0radiotap.present.dbm_tx_power 0radiotap.present.antenna 0radiotap.present.db_antsignal 0radiotap.present.db_antnoise 0radiotap.present.rxflags 0radiotap.present.xchannel 0                                                 ..wlan_mgt.rsn.version 0wlan_mgt.rsn.gcs.type 0wlan_mgt.rsn.pcs.count 0wlan_mgt.rsn.akms.count 0wlan_mgt.rsn.akms.type 0wlan_mgt.rsn.capabilities.preauth 0wlan_mgt.rsn.capabilities.no_pairwise 0wlan_mgt.rsn.capabilities.ptksa_replay_counter 0wlan_mgt.rsn.capabilities.gtksa_replay_counter 0wlan_mgt.rsn.capabilities.mfpr 0wlan_mgt.rsn.capabilities.mfpc 0wlan_mgt.rsn.capabilities.peerkey 0wlan_mgt.tcprep.trsmt_pow 0wlan_mgt.tcprep.link_mrg 0wlan.wep.iv 0wlan.wep.key 0wlan.wep.icv 0wlan.tkip.extiv 0wlan.ccmp.extiv 0wlan.qos.tid 0wlan.qos.priority 0wlan.qos.eosp 0wlan.qos.ack 0wlan.qos.amsdupresent 0wlan.qos.buf_state_indicated 0wlan.qos.bit4 0wlan.qos.txop_dur_req 0wlan.qos.buf_state_indicated.1 0data.len 0class 0Length: 155, dtype: int64

We  replace the ? marks with None:

awid.replace({"?": None}, inplace=True

We count how many missing pieces of data are shown:

awid.isna().sum()

The output will be as follows:

frame.interface_id                                      0
frame.dlt                                         1795575
frame.offset_shift                                      0
frame.time_epoch                                        0
frame.time_delta                                        0
frame.time_delta_displayed                              0
frame.time_relative                                     0
frame.len                                               0
frame.cap_len                                           0
frame.marked                                            0
frame.ignored                                           0
radiotap.version                                        0
radiotap.pad                                            0
radiotap.length                                         0
radiotap.present.tsft                                   0
radiotap.present.flags                                  0
radiotap.present.rate                                   0
radiotap.present.channel                                0
radiotap.present.fhss                                   0
radiotap.present.dbm_antsignal                          0
radiotap.present.dbm_antnoise                           0
radiotap.present.lock_quality                           0
radiotap.present.tx_attenuation                         0
radiotap.present.db_tx_attenuation                      0
radiotap.present.dbm_tx_power                           0
radiotap.present.antenna                                0
radiotap.present.db_antsignal                           0
radiotap.present.db_antnoise                            0
radiotap.present.rxflags                                0
radiotap.present.xchannel                               0
                                                   ...   
wlan_mgt.rsn.version                              1718631
wlan_mgt.rsn.gcs.type                             1718631
wlan_mgt.rsn.pcs.count                            1718631
wlan_mgt.rsn.akms.count                           1718633
wlan_mgt.rsn.akms.type                            1718651
wlan_mgt.rsn.capabilities.preauth                 1718633
wlan_mgt.rsn.capabilities.no_pairwise             1718633
wlan_mgt.rsn.capabilities.ptksa_replay_counter    1718633
wlan_mgt.rsn.capabilities.gtksa_replay_counter    1718633
wlan_mgt.rsn.capabilities.mfpr                    1718633
wlan_mgt.rsn.capabilities.mfpc                    1718633
wlan_mgt.rsn.capabilities.peerkey                 1718633
wlan_mgt.tcprep.trsmt_pow                         1795536
wlan_mgt.tcprep.link_mrg                          1795536
wlan.wep.iv                                        944820
wlan.wep.key                                       909831
wlan.wep.icv                                       944820
wlan.tkip.extiv                                   1763655
wlan.ccmp.extiv                                   1792506
wlan.qos.tid                                      1133234
wlan.qos.priority                                 1133234
wlan.qos.eosp                                     1279874
wlan.qos.ack                                      1133234
wlan.qos.amsdupresent                             1134226
wlan.qos.buf_state_indicated                      1795575
wlan.qos.bit4                                     1648935
wlan.qos.txop_dur_req                             1648935
wlan.qos.buf_state_indicated.1                    1279874
data.len                                           903021
class                                                   0
Length: 155, dtype: int64

The goal here is to remove columns that have over 50% of their data missing:

columns_with_mostly_null_data = awid.columns[awid.isnull().mean() >= 0.5]

We see 72 columns are going to be affected:

columns_with_mostly_null_data.shape

The output is as follows:

(72,)

We drop the columns with over half of their data missing:

awid.drop(columns_with_mostly_null_data, axis=1, inplace=True)awid.shape

The preceding code gives the following output:

(1795575, 83)

Drop the rows that have missing values:

awid.dropna(inplace=True) # drop rows with null data

We lose 456,169 rows:

awid.shape

The following is the output of the preceding code:

(1339406, 83)

However, dropping doesn't affect our distribution too much:

# 0.878763 is our null accuracy. Our model must be better than this number to be a contenderawid['class'].value_counts(normalize=True)

The output can be seen as follows:

normal 0.878763injection 0.048812impersonation 0.036227flooding 0.036198Name: class, dtype: float64

Now we execute the following code:

# only select numeric columns for our ML algorithms, there should be more..
awid.select_dtypes(['number']).shape.
(1339406, 45)
# transform all columns into numerical dtypesfor col in awid.columns: awid[col] = pd.to_numeric(awid[col], errors='ignore')# that makes more senseawid.select_dtypes(['number']).shape

The preceding code gives the following output:

(1339406, 74)

Now execute the awid.describe() code as shown in the following snippet:

# basic descroptive statistics
awid.describe()

The output will display a table of 8 rows × 74 columns.

X, y = awid.select_dtypes(['number']), awid['class']
# do a basic naive bayes fitting
from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()

# fit our model to the data
nb.fit(X, y)
GaussianNB(priors=None)

 We read in the test data and do the same transformations to it, to match the training data:

awid_test = pd.read_csv("../data/AWID-CLS-R-Tst.csv", header=None, names=features)
# drop the problematic columns
awid_test.drop(columns_with_mostly_null_data, axis=1, inplace=True)
# replace ? with None
awid_test.replace({"?": None}, inplace=True)
# drop the rows with null data
awid_test.dropna(inplace=True)  # drop rows with null data
# convert columns to numerical values
for col in awid_test.columns:
    awid_test[col] = pd.to_numeric(awid_test[col], errors='ignore')
awid_test.shape

The output can be seen as follows:

Out[45]:(389185, 83)

To check basic metric, accuracy of the code:

from sklearn.metrics import accuracy_score
X_test = awid_test.select_dtypes(['number'])
y_test = awid_test['class']

# simple function to test the accuracy of a model fitted on training data on our testing data
def get_test_accuracy_of(model):
    y_preds = model.predict(X_test)
    return accuracy_score(y_preds, y_test)
# naive abyes does very poorly on its own!
get_test_accuracy_of(nb)

The output is seen as follows:

0.26535452291326744

We will be using logistic regression for the following problem:

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

lr.fit(X, y)

# Logistic Regressions does even worse
get_test_accuracy_of(lr)

The following is the output:

0.015773989233911892

Importing a decision tree classifier, we get the following:

from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier()

tree.fit(X, y)

# Tree does very well!
get_test_accuracy_of(tree)

The output looks like this:

0.9336639387437851

We see the Gini scores of the decision tree's features:

pd.DataFrame({'feature':awid.select_dtypes(['number']).columns, 
              'importance':tree.feature_importances_}).sort_values('importance', ascending=False).head(10)

We will get output like this:

feature

importance

6

frame.len

0.230466

3

frame.time_delta

0.221151

68

wlan.fc.protected

0.145760

70

wlan.duration

0.127612

5

frame.time_relative

0.079571

7

frame.cap_len

0.059702

62

wlan.fc.type

0.040192

72

wlan.seq

0.026807

65

wlan.fc.retry

0.019807

58

radiotap.dbm_antsignal

0.014195

from sklearn.ensemble import RandomForestClassifier forest = RandomForestClassifier() forest.fit(X, y) # Random Forest does slightly worse get_test_accuracy_of(forest)

The output can be seen as follows:

0.9297326464277914

Create a pipeline that will scale the numerical data and then feed the resulting data into a decision tree:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

preprocessing = Pipeline([
    ("scale", StandardScaler()),
])

pipeline = Pipeline([
    ("preprocessing", preprocessing),
    ("classifier", DecisionTreeClassifier())
])

# try varying levels of depth
params = {
    "classifier__max_depth": [None, 3, 5, 10], 
         }

# instantiate a gridsearch module
grid = GridSearchCV(pipeline, params)
# fit the module
grid.fit(X, y)

# test the best model
get_test_accuracy_of(grid.best_estimator_)

The following shows the output:

0.9254930174595629

We try the same thing with a random forest:

preprocessing = Pipeline([
    ("scale", StandardScaler()),
])

pipeline = Pipeline([
    ("preprocessing", preprocessing),
    ("classifier", RandomForestClassifier())
])

# try varying levels of depth
params = {
    "classifier__max_depth": [None, 3, 5, 10], 
         }

grid = GridSearchCV(pipeline, params)
grid.fit(X, y)
# best accuracy so far!
get_test_accuracy_of(grid.best_estimator_)

The final accuracy is as follows:

0.9348176317175636