Table of Contents for
Hands-On Machine Learning for Cybersecurity

Version ebook / Retour

Cover image for bash Cookbook, 2nd Edition Hands-On Machine Learning for Cybersecurity by Sinan Ozdemir Published by Packt Publishing, 2018
  1. Hands-on Machine Learning for Cybersecurity
  2. Title Page
  3. Copyright and Credits
  4. Hands-On Machine Learning for Cybersecurity
  5. About Packt
  6. Why subscribe?
  7. Packt.com
  8. Contributors
  9. About the authors
  10. About the reviewers
  11. Packt is searching for authors like you
  12. Table of Contents
  13. Preface
  14. Who this book is for
  15. What this book covers
  16. To get the most out of this book
  17. Download the example code files
  18. Download the color images
  19. Conventions used
  20. Get in touch
  21. Reviews
  22. Basics of Machine Learning in Cybersecurity
  23. What is machine learning?
  24. Problems that machine learning solves
  25. Why use machine learning in cybersecurity?
  26. Current cybersecurity solutions
  27. Data in machine learning
  28. Structured versus unstructured data
  29. Labelled versus unlabelled data
  30. Machine learning phases
  31. Inconsistencies in data
  32. Overfitting
  33. Underfitting
  34. Different types of machine learning algorithm
  35. Supervised learning algorithms
  36. Unsupervised learning algorithms 
  37. Reinforcement learning
  38. Another categorization of machine learning
  39. Classification problems
  40. Clustering problems
  41. Regression problems
  42. Dimensionality reduction problems
  43. Density estimation problems
  44. Deep learning
  45. Algorithms in machine learning
  46. Support vector machines
  47. Bayesian networks
  48. Decision trees
  49. Random forests
  50. Hierarchical algorithms
  51. Genetic algorithms
  52. Similarity algorithms
  53. ANNs
  54. The machine learning architecture
  55. Data ingestion
  56. Data store
  57. The model engine
  58. Data preparation 
  59. Feature generation
  60. Training
  61. Testing
  62. Performance tuning
  63. Mean squared error
  64. Mean absolute error
  65. Precision, recall, and accuracy
  66. How can model performance be improved?
  67. Fetching the data to improve performance
  68. Switching machine learning algorithms
  69. Ensemble learning to improve performance
  70. Hands-on machine learning
  71. Python for machine learning
  72. Comparing Python 2.x with 3.x 
  73. Python installation 
  74. Python interactive development environment
  75. Jupyter Notebook installation
  76. Python packages
  77. NumPy
  78. SciPy
  79. Scikit-learn 
  80. pandas
  81. Matplotlib
  82. Mongodb with Python
  83. Installing MongoDB
  84. PyMongo
  85. Setting up the development and testing environment
  86. Use case
  87. Data
  88. Code
  89. Summary
  90. Time Series Analysis and Ensemble Modeling
  91. What is a time series?
  92. Time series analysis
  93. Stationarity of a time series models
  94. Strictly stationary process
  95. Correlation in time series
  96. Autocorrelation
  97. Partial autocorrelation function
  98. Classes of time series models
  99. Stochastic time series model
  100. Artificial neural network time series model
  101.  Support vector time series models
  102. Time series components
  103. Systematic models
  104. Non-systematic models
  105. Time series decomposition
  106. Level 
  107. Trend 
  108. Seasonality 
  109. Noise 
  110. Use cases for time series
  111. Signal processing
  112. Stock market predictions
  113. Weather forecasting
  114. Reconnaissance detection
  115. Time series analysis in cybersecurity
  116. Time series trends and seasonal spikes
  117. Detecting distributed denial of series with time series
  118. Dealing with the time element in time series
  119. Tackling the use case
  120. Importing packages
  121. Importing data in pandas
  122. Data cleansing and transformation
  123. Feature computation
  124. Predicting DDoS attacks
  125. ARMA
  126. ARIMA
  127. ARFIMA
  128. Ensemble learning methods
  129. Types of ensembling
  130. Averaging
  131. Majority vote
  132. Weighted average
  133. Types of ensemble algorithm
  134. Bagging
  135. Boosting
  136. Stacking
  137. Bayesian parameter averaging
  138. Bayesian model combination
  139. Bucket of models
  140. Cybersecurity with ensemble techniques
  141. Voting ensemble method to detect cyber attacks
  142. Summary
  143. Segregating Legitimate and Lousy URLs
  144. Introduction to the types of abnormalities in URLs
  145. URL blacklisting
  146. Drive-by download URLs
  147. Command and control URLs
  148. Phishing URLs
  149. Using heuristics to detect malicious pages
  150. Data for the analysis
  151. Feature extraction
  152. Lexical features
  153. Web-content-based features
  154. Host-based features
  155. Site-popularity features
  156. Using machine learning to detect malicious URLs 
  157. Logistic regression to detect malicious URLs
  158. Dataset
  159. Model
  160. TF-IDF
  161. SVM to detect malicious URLs
  162. Multiclass classification for URL classification
  163. One-versus-rest
  164. Summary
  165. Knocking Down CAPTCHAs
  166. Characteristics of CAPTCHA
  167. Using artificial intelligence to crack CAPTCHA
  168. Types of CAPTCHA
  169. reCAPTCHA
  170. No CAPTCHA reCAPTCHA
  171. Breaking a CAPTCHA
  172. Solving CAPTCHAs with a neural network
  173. Dataset 
  174. Packages
  175. Theory of CNN
  176. Model
  177. Code
  178. Training the model
  179. Testing the model 
  180. Summary
  181. Using Data Science to Catch Email Fraud and Spam
  182. Email spoofing 
  183. Bogus offers
  184. Requests for help
  185. Types of spam emails
  186. Deceptive emails
  187. CEO fraud
  188. Pharming 
  189. Dropbox phishing
  190. Google Docs phishing
  191. Spam detection
  192. Types of mail servers 
  193. Data collection from mail servers
  194. Using the Naive Bayes theorem to detect spam
  195. Laplace smoothing
  196. Featurization techniques that convert text-based emails into numeric values
  197. Log-space
  198. TF-IDF
  199. N-grams
  200. Tokenization
  201. Logistic regression spam filters
  202. Logistic regression
  203. Dataset
  204. Python
  205. Results
  206. Summary
  207. Efficient Network Anomaly Detection Using k-means
  208. Stages of a network attack
  209. Phase 1 – Reconnaissance 
  210. Phase 2 – Initial compromise 
  211. Phase 3 – Command and control 
  212. Phase 4 – Lateral movement
  213. Phase 5 – Target attainment 
  214. Phase 6 – Ex-filtration, corruption, and disruption 
  215. Dealing with lateral movement in networks
  216. Using Windows event logs to detect network anomalies
  217. Logon/Logoff events 
  218. Account logon events
  219. Object access events
  220. Account management events
  221. Active directory events
  222. Ingesting active directory data
  223. Data parsing
  224. Modeling
  225. Detecting anomalies in a network with k-means
  226. Network intrusion data
  227. Coding the network intrusion attack
  228. Model evaluation 
  229. Sum of squared errors
  230. Choosing k for k-means
  231. Normalizing features
  232. Manual verification
  233. Summary
  234. Decision Tree and Context-Based Malicious Event Detection
  235. Adware
  236. Bots
  237. Bugs
  238. Ransomware
  239. Rootkit
  240. Spyware
  241. Trojan horses
  242. Viruses
  243. Worms
  244. Malicious data injection within databases
  245. Malicious injections in wireless sensors
  246. Use case
  247. The dataset
  248. Importing packages 
  249. Features of the data
  250. Model
  251. Decision tree 
  252. Types of decision trees
  253. Categorical variable decision tree
  254. Continuous variable decision tree
  255. Gini coeffiecient
  256. Random forest
  257. Anomaly detection
  258. Isolation forest
  259. Supervised and outlier detection with Knowledge Discovery Databases (KDD)
  260. Revisiting malicious URL detection with decision trees
  261. Summary
  262. Catching Impersonators and Hackers Red Handed
  263. Understanding impersonation
  264. Different types of impersonation fraud 
  265. Impersonators gathering information
  266. How an impersonation attack is constructed
  267. Using data science to detect domains that are impersonations
  268. Levenshtein distance
  269. Finding domain similarity between malicious URLs
  270. Authorship attribution
  271. AA detection for tweets
  272. Difference between test and validation datasets
  273. Sklearn pipeline
  274. Naive Bayes classifier for multinomial models
  275. Identifying impersonation as a means of intrusion detection 
  276. Summary
  277. Changing the Game with TensorFlow
  278. Introduction to TensorFlow
  279. Installation of TensorFlow
  280. TensorFlow for Windows users
  281. Hello world in TensorFlow
  282. Importing the MNIST dataset
  283. Computation graphs
  284. What is a computation graph?
  285. Tensor processing unit
  286. Using TensorFlow for intrusion detection
  287. Summary
  288. Financial Fraud and How Deep Learning Can Mitigate It
  289. Machine learning to detect financial fraud
  290. Imbalanced data
  291. Handling imbalanced datasets
  292. Random under-sampling
  293. Random oversampling
  294. Cluster-based oversampling
  295. Synthetic minority oversampling technique
  296. Modified synthetic minority oversampling technique
  297. Detecting credit card fraud
  298. Logistic regression
  299. Loading the dataset
  300. Approach
  301. Logistic regression classifier – under-sampled data
  302. Tuning hyperparameters 
  303. Detailed classification reports
  304. Predictions on test sets and plotting a confusion matrix
  305. Logistic regression classifier – skewed data
  306. Investigating precision-recall curve and area
  307. Deep learning time
  308. Adam gradient optimizer
  309. Summary
  310. Case Studies
  311. Introduction to our password dataset
  312. Text feature extraction
  313. Feature extraction with scikit-learn
  314. Using the cosine similarity to quantify bad passwords
  315. Putting it all together
  316. Summary
  317. Other Books You May Enjoy
  318. Leave a review - let other readers know what you think

Using TensorFlow for intrusion detection

We will use the intrusion detection problem again to detect anomalies. Initially, we will import pandas, as shown:

import pandas as pd

We get the names of the features from the dataset at this link: http://icsdweb.aegean.gr/awid/features.html.

We will include the features code as shown here:

features = ['frame.interface_id',
'frame.dlt',
'frame.offset_shift',
'frame.time_epoch',
'frame.time_delta',
'frame.time_delta_displayed',
'frame.time_relative',
'frame.len',
'frame.cap_len',
'frame.marked',
'frame.ignored',
'radiotap.version',
'radiotap.pad',
'radiotap.length',
'radiotap.present.tsft',
'radiotap.present.flags',
'radiotap.present.rate',
'radiotap.present.channel',
'radiotap.present.fhss',
'radiotap.present.dbm_antsignal',
...

The preceding list contains all 155 features in the AWID dataset. We import the training set and see the number of rows and columns:

awid = pd.read_csv("../data/AWID-CLS-R-Trn.csv", header=None, names=features)

# see the number of rows/columns
awid.shape

We can ignore the warning:

/Users/sinanozdemir/Desktop/cyber/env/lib/python2.7/site-packages/IPython/core/interactiveshell.py:2714: DtypeWarning: Columns (37,38,39,40,41,42,43,44,45,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,74,88) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

The output of the shape is a list of all the training data in the 155-feature dataset:

(1795575, 155)

We will eventually have to replace the None values:

# they use ? as a null attribute.
awid.head()

The preceding code will produce a table of 5 rows × 155 columns as an output.

We see the distribution of response vars:

awid['class'].value_counts(normalize=True)

normal 0.909564
injection 0.036411
impersonation 0.027023
flooding 0.027002
Name: class, dtype: float64

We check for NAs:

# claims there are no null values because of the ?'s'
awid.isna().sum()

The output looks like this:

frame.interface_id 0
frame.dlt 1795575
frame.offset_shift 0
frame.time_epoch 0
frame.time_delta 0
frame.time_delta_displayed 0
frame.time_relative 0
frame.len 0
frame.cap_len 0
frame.marked 0
frame.ignored 0
radiotap.version 0
radiotap.pad 0
radiotap.length 0
radiotap.present.tsft 0
radiotap.present.flags 0
radiotap.present.rate 0
radiotap.present.channel 0
radiotap.present.fhss 0
radiotap.present.dbm_antsignal 0
radiotap.present.dbm_antnoise 0
radiotap.present.lock_quality 0
radiotap.present.tx_attenuation 0
radiotap.present.db_tx_attenuation 0
radiotap.present.dbm_tx_power 0
radiotap.present.antenna 0
radiotap.present.db_antsignal 0
radiotap.present.db_antnoise 0
radiotap.present.rxflags 0
radiotap.present.xchannel 0
...
wlan_mgt.rsn.version 1718631
wlan_mgt.rsn.gcs.type 1718631
wlan_mgt.rsn.pcs.count 1718631
wlan_mgt.rsn.akms.count 1718633
wlan_mgt.rsn.akms.type 1718651
wlan_mgt.rsn.capabilities.preauth 1718633
wlan_mgt.rsn.capabilities.no_pairwise 1718633
wlan_mgt.rsn.capabilities.ptksa_replay_counter 1718633
wlan_mgt.rsn.capabilities.gtksa_replay_counter 1718633
wlan_mgt.rsn.capabilities.mfpr 1718633
wlan_mgt.rsn.capabilities.mfpc 1718633
wlan_mgt.rsn.capabilities.peerkey 1718633
wlan_mgt.tcprep.trsmt_pow 1795536
wlan_mgt.tcprep.link_mrg 1795536
wlan.wep.iv 944820
wlan.wep.key 909831
wlan.wep.icv 944820
wlan.tkip.extiv 1763655
wlan.ccmp.extiv 1792506
wlan.qos.tid 1133234
wlan.qos.priority 1133234
wlan.qos.eosp 1279874
wlan.qos.ack 1133234
wlan.qos.amsdupresent 1134226
wlan.qos.buf_state_indicated 1795575
wlan.qos.bit4 1648935
wlan.qos.txop_dur_req 1648935
wlan.qos.buf_state_indicated.1 1279874
data.len 903021
class 0
Length: 155, dtype: int64

We replace all ? marks with None:

# replace the ? marks with None
awid.replace({"?": None}, inplace=True)

The sum shows a large amount of missing data:

# Many missing pieces of data!
awid.isna().sum()

Here is what the output looks like:


frame.interface_id 0
frame.dlt 1795575
frame.offset_shift 0
frame.time_epoch 0
frame.time_delta 0
frame.time_delta_displayed 0
frame.time_relative 0
frame.len 0
frame.cap_len 0
frame.marked 0
frame.ignored 0
radiotap.version 0
radiotap.pad 0
radiotap.length 0
radiotap.present.tsft 0
radiotap.present.flags 0
radiotap.present.rate 0
radiotap.present.channel 0
radiotap.present.fhss 0
radiotap.present.dbm_antsignal 0
radiotap.present.dbm_antnoise 0
radiotap.present.lock_quality 0
radiotap.present.tx_attenuation 0
radiotap.present.db_tx_attenuation 0
radiotap.present.dbm_tx_power 0
radiotap.present.antenna 0
radiotap.present.db_antsignal 0
radiotap.present.db_antnoise 0
radiotap.present.rxflags 0
radiotap.present.xchannel 0
...
wlan_mgt.rsn.version 1718631
wlan_mgt.rsn.gcs.type 1718631
wlan_mgt.rsn.pcs.count 1718631
wlan_mgt.rsn.akms.count 1718633
wlan_mgt.rsn.akms.type 1718651
wlan_mgt.rsn.capabilities.preauth 1718633
wlan_mgt.rsn.capabilities.no_pairwise 1718633
wlan_mgt.rsn.capabilities.ptksa_replay_counter 1718633
wlan_mgt.rsn.capabilities.gtksa_replay_counter 1718633
wlan_mgt.rsn.capabilities.mfpr 1718633
wlan_mgt.rsn.capabilities.mfpc 1718633
wlan_mgt.rsn.capabilities.peerkey 1718633
wlan_mgt.tcprep.trsmt_pow 1795536
wlan_mgt.tcprep.link_mrg 1795536
wlan.wep.iv 944820
wlan.wep.key 909831
wlan.wep.icv 944820
wlan.tkip.extiv 1763655
wlan.ccmp.extiv 1792506
wlan.qos.tid 1133234
wlan.qos.priority 1133234
wlan.qos.eosp 1279874
wlan.qos.ack 1133234
wlan.qos.amsdupresent 1134226
wlan.qos.buf_state_indicated 1795575
wlan.qos.bit4 1648935
wlan.qos.txop_dur_req 1648935
wlan.qos.buf_state_indicated.1 1279874
data.len 903021

Here, we remove columns that have over 50% of their data missing:

columns_with_mostly_null_data = awid.columns[awid.isnull().mean() >= 0.5]

# 72 columns are going to be affected!
columns_with_mostly_null_data.shape

Out[11]:
(72,)

We drop the columns with over 50% of their data missing:

awid.drop(columns_with_mostly_null_data, axis=1, inplace=True)

The output can be seen as follows:

awid.shape

(1795575, 83)

Now, drop the rows that have missing values:

# 
awid.dropna(inplace=True)  # drop rows with null data

 

We lost 456,169 rows:

awid.shape

(1339406, 83)

However, it doesn't affect our distribution too much:

# 0.878763 is our null accuracy. Our model must be better than this number to be a contender

awid['class'].value_counts(normalize=True)

normal 0.878763
injection 0.048812
impersonation 0.036227
flooding 0.036198
Name: class, dtype: float64

We only select numerical columns for our ML algorithms, but there should be more:

awid.select_dtypes(['number']).shape

(1339406, 45)

We transform all columns into numerical dtypes:

for col in awid.columns:
    awid[col] = pd.to_numeric(awid[col], errors='ignore')
# that makes more sense
awid.select_dtypes(['number']).shape

The output can be seen here:

Out[19]:

(1339406, 74)

We derive basic descriptive statistics:

awid.describe()

By executing the preceding code will get a table of 8 rows × 74 columns.

X, y = awid.select_dtypes(['number']), awid['class']

We do a basic Naive Bayes fitting. We fit our model to the data:

from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()

nb.fit(X, y)

Gaussian Naive Bayes is performed as follows:

GaussianNB(priors=None, var_smoothing=1e-09)

We read in the test data and do the same transformations to it, to match the training data:

awid_test = pd.read_csv("../data/AWID-CLS-R-Tst.csv", header=None, names=features)

# drop the problematic columns
awid_test.drop(columns_with_mostly_null_data, axis=1, inplace=True)

# replace ? with None
awid_test.replace({"?": None}, inplace=True)

# drop the rows with null data
awid_test.dropna(inplace=True) # drop rows with null data

# convert columns to numerical values
for col in awid_test.columns:
awid_test[col] = pd.to_numeric(awid_test[col], errors='ignore')
awid_test.shape

The output is as follows:

Out[23]:

(389185, 83)

We compute the basic metric, accuracy:

from sklearn.metrics import accuracy_score

We define a simple function to test the accuracy of a model fitted on training data by using our testing data:

X_test = awid_test.select_dtypes(['number'])
y_test = awid_test['class']

def get_test_accuracy_of(model):
y_preds = model.predict(X_test)
return accuracy_score(y_preds, y_test)

# naive bayes does very poorly on its own!
get_test_accuracy_of(nb)

The output can be seen here:

Out[25]:

0.26535452291326744

We perform logistic regression, but it performs even worse:

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

lr.fit(X, y)

# Logistic Regressions does even worse
get_test_accuracy_of(lr)

We can ignore this warning:

/Users/sinanozdemir/Desktop/cyber/env/lib/python2.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
/Users/sinanozdemir/Desktop/cyber/env/lib/python2.7/site-packages/sklearn/linear_model/logistic.py:459: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
"this warning.", FutureWarning)

The following shows the output:

Out[26]:

0.015773989233911892

We test with DecisionTreeClassifier as shown here:

from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier()

tree.fit(X, y)

# Tree does very well!
get_test_accuracy_of(tree)

The output can be seen as follows:

Out[27]:

0.9280830453383352

We test the Gini scores of the decision tree features as follows:

pd.DataFrame({'feature':awid.select_dtypes(['number']).columns, 
'importance':tree.feature_importances_}).sort_values('importance', ascending=False).head(10)

The output of the preceding code gives the following table:

feature

importance

7

frame.cap_len

0.222489

4

frame.time_delta_displayed

0.221133

68

wlan.fc.protected

0.146001

70

wlan.duration

0.127674

5

frame.time_relative

0.077353

6

frame.len

0.067667

62

wlan.fc.type

0.039926

72

wlan.seq

0.027947

65

wlan.fc.retry

0.019839

58

radiotap.dbm_antsignal

0.014197

 

We import RandomForestClassifier as shown here:

from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier()

forest.fit(X, y)

# Random Forest does slightly worse
get_test_accuracy_of(forest)

We can ignore this warning:

/Users/sinanozdemir/Desktop/cyber/env/lib/python2.7/site-packages/sklearn/ensemble/forest.py:248: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
"10 in version 0.20 to 100 in 0.22.", FutureWarning)

The following is the output:

Out[29]:

0.9357349332579622

We create a pipeline that will scale the numerical data and then feed the resulting data into a decision tree:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

preprocessing = Pipeline([
("scale", StandardScaler()),
])

pipeline = Pipeline([
("preprocessing", preprocessing),
("classifier", DecisionTreeClassifier())
])

# try varying levels of depth
params = {
"classifier__max_depth": [None, 3, 5, 10],
}

# instantiate a gridsearch module
grid = GridSearchCV(pipeline, params)
# fit the module
grid.fit(X, y)

# test the best model
get_test_accuracy_of(grid.best_estimator_)

We can ignore this warning:

/Users/sinanozdemir/Desktop/cyber/env/lib/python2.7/site-packages/sklearn/model_selection/_split.py:1943: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
  warnings.warn(CV_WARNING, FutureWarning)
/Users/sinanozdemir/Desktop/cyber/env/lib/python2.7/site-packages/sklearn/preprocessing/data.py:617: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.partial_fit(X, y)
/Users/sinanozdemir/Desktop/cyber/env/lib/python2.7/site-packages/sklearn/base.py:465: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.fit(X, y, **fit_params).transform(X)
/Users/sinanozdemir/Desktop/cyber/env/lib/python2.7/site-packages/sklearn/pipeline.py:451: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  Xt = transform.transform(Xt)

The output is as follows:

Out[30]:

0.926258720145946

We try the same thing with a random forest:

 preprocessing = Pipeline([
("scale", StandardScaler()),
])

pipeline = Pipeline([
("preprocessing", preprocessing),
("classifier", RandomForestClassifier())
])

# try varying levels of depth
params = {
"classifier__max_depth": [None, 3, 5, 10],
}

grid = GridSearchCV(pipeline, params)
grid.fit(X, y)
# best accuracy so far!
get_test_accuracy_of(grid.best_estimator_)

The following shows the output:

Out[31]:

0.8893431144571348

We import LabelEncoder:

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
encoded_y = encoder.fit_transform(y)
encoded_y.shape

The output is as follows:

Out[119]:

(1339406,)

encoded_y
Out[121]:

array([3, 3, 3, ..., 3, 3, 3])

We do this to import LabelBinarizer:

from sklearn.preprocessing import LabelBinarizer
binarizer = LabelBinarizer()
binarized_y = binarizer.fit_transform(encoded_y)
binarized_y.shape

We will get the following output:

(1339406, 4)

Now, execute the following code:

binarized_y[:5,]

And the output will be as follows:

array([[0, 0, 0, 1],
       [0, 0, 0, 1],
       [0, 0, 0, 1],
       [0, 0, 0, 1],
       [0, 0, 0, 1]])

Run the y.head() command:

y.head()

The output is as follows:

0    normal
1    normal
2    normal
3    normal
4    normal
Name: class, dtype: object

Now run the following code:

print encoder.classes_
print binarizer.classes_

The output can be seen as follows:

['flooding' 'impersonation' 'injection' 'normal']
[0 1 2 3]

Import the following packages:

from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier

We baseline the model for the neural network. We choose a hidden layer of 10 neurons. A lower number of neurons helps to eliminate the redundancies in the data and select the most important features:

def create_baseline_model(n, input_dim):
    # create model
    model = Sequential()
    model.add(Dense(n, input_dim=input_dim, kernel_initializer='normal', activation='relu'))
    model.add(Dense(4, kernel_initializer='normal', activation='sigmoid'))
    # Compile model. We use the the logarithmic loss function, and the Adam gradient optimizer.
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

KerasClassifier(build_fn=create_baseline_model, epochs=100, batch_size=5, verbose=0, n=20)

We can see the following output:

<keras.wrappers.scikit_learn.KerasClassifier at 0x149c1c210>

Run the following code:

# use the KerasClassifier

preprocessing = Pipeline([
    ("scale", StandardScaler()),
])

pipeline = Pipeline([
    ("preprocessing", preprocessing),
    ("classifier", KerasClassifier(build_fn=create_baseline_model, epochs=2, batch_size=128, 
                                   verbose=1, n=10, input_dim=74))
])

cross_val_score(pipeline, X, binarized_y)

The Epoch length can be seen as follows:

Epoch 1/2
892937/892937 [==============================] - 21s 24us/step - loss: 0.1027 - acc: 0.9683
Epoch 2/2
892937/892937 [==============================] - 18s 20us/step - loss: 0.0314 - acc: 0.9910
446469/446469 [==============================] - 4s 10us/step
Epoch 1/2
892937/892937 [==============================] - 24s 27us/step - loss: 0.1089 - acc: 0.9682
Epoch 2/2
892937/892937 [==============================] - 19s 22us/step - loss: 0.0305 - acc: 0.9919 0s - loss: 0.0
446469/446469 [==============================] - 4s 9us/step
Epoch 1/2
892938/892938 [==============================] - 18s 20us/step - loss: 0.0619 - acc: 0.9815
Epoch 2/2
892938/892938 [==============================] - 17s 20us/step - loss: 0.0153 - acc: 0.9916
446468/446468 [==============================] - 4s 9us/step

The output for the preceding code is as follows:

array([0.97450887, 0.99176875, 0.74421683])
# notice the LARGE variance in scores of a neural network. This is due to the high-variance nature of how networks fit
# using stochastic gradient descent

pipeline.fit(X, binarized_y)
Epoch 1/2
1339406/1339406 [==============================] - 29s 22us/step - loss: 0.0781 - acc: 0.9740
Epoch 2/2
1339406/1339406 [==============================] - 25s 19us/step - loss: 0.0298 - acc: 0.9856

We will get the following code as an output:

Pipeline(memory=None,
steps=[('preprocessing', Pipeline(memory=None,
steps=[('scale', StandardScaler(copy=True, with_mean=True, with_std=True))])), ('classifier', <keras.wrappers.scikit_learn.KerasClassifier object at 0x149c1c350>)])

Now execute the following code:

# remake 
encoded_y_test = encoder.transform(y_test)
def get_network_test_accuracy_of(model):
    y_preds = model.predict(X_test)
    return accuracy_score(y_preds, encoded_y_test)

# not the best accuracy

get_network_test_accuracy_of(pipeline)

389185/389185 [==============================] - 3s 7us/step

The following is the output of the preceding input:

0.889327697624523

By fitting again, we get a different test accuracy. This also highlights the variance on the network:

# 
pipeline.fit(X, binarized_y)
get_network_test_accuracy_of(pipeline)
Epoch 1/2
1339406/1339406 [==============================] - 29s 21us/step - loss: 0.0844 - acc: 0.9735 0s - loss: 0.085
Epoch 2/2
1339406/1339406 [==============================] - 32s 24us/step - loss: 0.0323 - acc: 0.9853 0s - loss: 0.0323 - acc: 0
389185/389185 [==============================] - 4s 11us/step

We will get the following code:

0.8742526048023433

We add some more epochs to learn more:

preprocessing = Pipeline([
("scale", StandardScaler()),
])

pipeline = Pipeline([
("preprocessing", preprocessing),
("classifier", KerasClassifier(build_fn=create_baseline_model, epochs=10, batch_size=128,
verbose=1, n=10, input_dim=74))
])

cross_val_score(pipeline, X, binarized_y)

We get output as follows:

Epoch 1/10
892937/892937 [==============================] - 20s 22us/step - loss: 0.0945 - acc: 0.9744
Epoch 2/10
892937/892937 [==============================] - 17s 19us/step - loss: 0.0349 - acc: 0.9906
Epoch 3/10
892937/892937 [==============================] - 16s 18us/step - loss: 0.0293 - acc: 0.9920
Epoch 4/10
892937/892937 [==============================] - 17s 20us/step - loss: 0.0261 - acc: 0.9932
Epoch 5/10
892937/892937 [==============================] - 18s 20us/step - loss: 0.0231 - acc: 0.9938 0s - loss: 0.0232 - ac
Epoch 6/10
892937/892937 [==============================] - 15s 17us/step - loss: 0.0216 - acc: 0.9941
Epoch 7/10
892937/892937 [==============================] - 21s 23us/step - loss: 0.0206 - acc: 0.9944
Epoch 8/10
892937/892937 [==============================] - 17s 20us/step - loss: 0.0199 - acc: 0.9947 0s - loss: 0.0198 - a
Epoch 9/10
892937/892937 [==============================] - 17s 19us/step - loss: 0.0194 - acc: 0.9948
Epoch 10/10
892937/892937 [==============================] - 17s 19us/step - loss: 0.0189 - acc: 0.9950
446469/446469 [==============================] - 4s 10us/step
Epoch 1/10
892937/892937 [==============================] - 19s 21us/step - loss: 0.1160 - acc: 0.9618
...
Out[174]:

array([0.97399595, 0.9939951 , 0.74381591])

By fitting again, we get a different test accuracy. This also highlights the variance on the network:

pipeline.fit(X, binarized_y)
get_network_test_accuracy_of(pipeline)
Epoch 1/10
1339406/1339406 [==============================] - 30s 22us/step - loss: 0.0812 - acc: 0.9754
Epoch 2/10
1339406/1339406 [==============================] - 27s 20us/step - loss: 0.0280 - acc: 0.9915
Epoch 3/10
1339406/1339406 [==============================] - 28s 21us/step - loss: 0.0226 - acc: 0.9921
Epoch 4/10
1339406/1339406 [==============================] - 27s 20us/step - loss: 0.0193 - acc: 0.9940
Epoch 5/10
1339406/1339406 [==============================] - 28s 21us/step - loss: 0.0169 - acc: 0.9951
Epoch 6/10
1339406/1339406 [==============================] - 34s 25us/step - loss: 0.0155 - acc: 0.9955
Epoch 7/10
1339406/1339406 [==============================] - 38s 28us/step - loss: 0.0148 - acc: 0.9957
Epoch 8/10
1339406/1339406 [==============================] - 34s 25us/step - loss: 0.0143 - acc: 0.9958 3s -
Epoch 9/10
1339406/1339406 [==============================] - 29s 21us/step - loss: 0.0139 - acc: 0.9960
Epoch 10/10
1339406/1339406 [==============================] - 28s 21us/step - loss: 0.0134 - acc: 0.9961
389185/389185 [==============================] - 3s 8us/step

The output of the preceding code is as follows:

0.8725027943009109

This took much longer and still didn't increase the accuracy. We change our function to have multiple hidden layers in our network:


def network_builder(hidden_dimensions, input_dim):
# create model
model = Sequential()
model.add(Dense(hidden_dimensions[0], input_dim=input_dim, kernel_initializer='normal', activation='relu'))

# add multiple hidden layers
for dimension in hidden_dimensions[1:]:
model.add(Dense(dimension, kernel_initializer='normal', activation='relu'))
model.add(Dense(4, kernel_initializer='normal', activation='sigmoid'))

# Compile model. We use the the logarithmic loss function, and the Adam gradient optimizer.
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model

We add some more hidden layers to learn more:

# 
preprocessing = Pipeline([
    ("scale", StandardScaler()),
])

pipeline = Pipeline([
    ("preprocessing", preprocessing),
    ("classifier", KerasClassifier(build_fn=network_builder, epochs=10, batch_size=128, 
                                   verbose=1, hidden_dimensions=(60,30,10), input_dim=74))
])

cross_val_score(pipeline, X, binarized_y)

We get the output as follows:

Epoch 1/10
892937/892937 [==============================] - 24s 26us/step - loss: 0.0457 - acc: 0.9860
Epoch 2/10
892937/892937 [==============================] - 21s 24us/step - loss: 0.0113 - acc: 0.9967
Epoch 3/10
892937/892937 [==============================] - 21s 23us/step - loss: 0.0079 - acc: 0.9977
Epoch 4/10
892937/892937 [==============================] - 26s 29us/step - loss: 0.0066 - acc: 0.9982
Epoch 5/10
892937/892937 [==============================] - 24s 27us/step - loss: 0.0061 - acc: 0.9983
Epoch 6/10
892937/892937 [==============================] - 25s 28us/step - loss: 0.0057 - acc: 0.9984
Epoch 7/10
892937/892937 [==============================] - 24s 27us/step - loss: 0.0051 - acc: 0.9985
Epoch 8/10
892937/892937 [==============================] - 24s 27us/step - loss: 0.0050 - acc: 0.9986
Epoch 9/10
892937/892937 [==============================] - 25s 28us/step - loss: 0.0046 - acc: 0.9986
Epoch 10/10
892937/892937 [==============================] - 23s 26us/step - loss: 0.0044 - acc: 0.9987
446469/446469 [==============================] - 6s 12us/step
Epoch 1/10
892937/892937 [==============================] - 27s 30us/step - loss: 0.0538 - acc: 0.9826

For binarized_y, we get this:

pipeline.fit(X, binarized_y)
get_network_test_accuracy_of(pipeline)

We get the epoch output as follows:

Epoch 1/10
1339406/1339406 [==============================] - 31s 23us/step - loss: 0.0422 - acc: 0.9865
Epoch 2/10
1339406/1339406 [==============================] - 28s 21us/step - loss: 0.0095 - acc: 0.9973
Epoch 3/10
1339406/1339406 [==============================] - 29s 22us/step - loss: 0.0068 - acc: 0.9981
Epoch 4/10
1339406/1339406 [==============================] - 28s 21us/step - loss: 0.0056 - acc: 0.9984
Epoch 5/10
1339406/1339406 [==============================] - 29s 21us/step - loss: 0.0051 - acc: 0.9986
Epoch 6/10
1339406/1339406 [==============================] - 28s 21us/step - loss: 0.0047 - acc: 0.9987
Epoch 7/10
1339406/1339406 [==============================] - 30s 22us/step - loss: 0.0041 - acc: 0.9988 0s - loss: 0.0041 - acc: 0.99 - ETA: 0s - loss: 0.0041 - acc: 0.998 - ETA: 0s - loss: 0.0041 - acc: 0
Epoch 8/10
1339406/1339406 [==============================] - 29s 22us/step - loss: 0.0039 - acc: 0.9989
Epoch 9/10
1339406/1339406 [==============================] - 29s 22us/step - loss: 0.0039 - acc: 0.9989
Epoch 10/10
1339406/1339406 [==============================] - 28s 21us/step - loss: 0.0036 - acc: 0.9990 0s - loss: 0.0036 - acc:
389185/389185 [==============================] - 3s 9us/step
...
Out[179]

0.8897876331307732

We got a small bump by increasing the hidden layers. Adding some more hidden layers to learn more, we get the following:


preprocessing = Pipeline([
("scale", StandardScaler()),
])

pipeline = Pipeline([
("preprocessing", preprocessing),
("classifier", KerasClassifier(build_fn=network_builder, epochs=10, batch_size=128,
verbose=1, hidden_dimensions=(30,30,30,10), input_dim=74))
])

cross_val_score(pipeline, X, binarized_y)

The Epoch output is as shown here:

Epoch 1/10
892937/892937 [==============================] - 25s 28us/step - loss: 0.0671 - acc: 0.9709
Epoch 2/10
892937/892937 [==============================] - 21s 23us/step - loss: 0.0139 - acc: 0.9963
Epoch 3/10
892937/892937 [==============================] - 20s 22us/step - loss: 0.0100 - acc: 0.9973
Epoch 4/10
892937/892937 [==============================] - 25s 28us/step - loss: 0.0087 - acc: 0.9977
Epoch 5/10
892937/892937 [==============================] - 21s 24us/step - loss: 0.0078 - acc: 0.9979
Epoch 6/10
892937/892937 [==============================] - 21s 24us/step - loss: 0.0072 - acc: 0.9981
Epoch 7/10
892937/892937 [==============================] - 24s 27us/step - loss: 0.0069 - acc: 0.9982
Epoch 8/10
892937/892937 [==============================] - 24s 27us/step - loss: 0.0064 - acc: 0.9984
...

The output can be seen as follows:

array([0.97447527, 0.99417877, 0.74292446])

Execute the following command pipeline.fit():

pipeline.fit(X, binarized_y)
get_network_test_accuracy_of(pipeline)
Epoch 1/10
1339406/1339406 [==============================] - 48s 36us/step - loss: 0.0666 - acc: 0.9548
Epoch 2/10
1339406/1339406 [==============================] - 108s 81us/step - loss: 0.0346 - acc: 0.9663
Epoch 3/10
1339406/1339406 [==============================] - 78s 59us/step - loss: 0.0261 - acc: 0.9732
Epoch 4/10
1339406/1339406 [==============================] - 102s 76us/step - loss: 0.0075 - acc: 0.9980
Epoch 5/10
1339406/1339406 [==============================] - 71s 53us/step - loss: 0.0066 - acc: 0.9983
Epoch 6/10
1339406/1339406 [==============================] - 111s 83us/step - loss: 0.0059 - acc: 0.9985
Epoch 7/10
1339406/1339406 [==============================] - 98s 73us/step - loss: 0.0055 - acc: 0.9986
Epoch 8/10
1339406/1339406 [==============================] - 93s 70us/step - loss: 0.0052 - acc: 0.9987
Epoch 9/10
1339406/1339406 [==============================] - 88s 66us/step - loss: 0.0051 - acc: 0.9988
Epoch 10/10
1339406/1339406 [==============================] - 87s 65us/step - loss: 0.0049 - acc: 0.9988
389185/389185 [==============================] - 16s 41us/step

By executing the preceding code we will get the following ouput:

0.8899315235684828

The best result so far comes from using deep learning. However, deep learning isn't the best choice for all datasets.