Hands-On Machine Learning for Cybersecurity

We are not going to perform feature engineering in the first instance. The dataset has been downgraded in order to contain 30 features (28 anonymized + time + amount).
We compare what happens when using resampling and when not using it. We test this approach using a simple logistic regression classifier.
We evaluate the models by using some of the performance metrics mentioned previously.
We repeat the best resampling/not-resampling method by tuning the parameters in the logistic-regression classifier.
We perform a classifications model using other classification algorithms.

Setting our input and target variables + resampling:

Normalize the Amount column
The Amount column is not in line with the anonymized features:

from sklearn.preprocessing import StandardScaler
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))
data = data.drop(['Time','Amount'],axis=1)
data.head()

The preceding code provides the table that shows 5 rows × 30 columns.

As we mentioned earlier, there are several ways to resample skewed data. Apart from under-sampling and oversampling, there is a very popular approach called SMOTE, which is a combination of oversampling and under-sampling, but the oversampling approach is not done by replicating a minority class but by constructing a new minority class data instance via an algorithm.

In this notebook, we will use traditional under-sampling.

The way we will under-sample the dataset is by creating a 50:50 ratio. This will be done by randomly selecting x number of samples from the majority class, being x the total number of records with the minority class:

X = data.iloc[:, data.columns != 'Class']
y = data.iloc[:, data.columns == 'Class']

We count the number of data points in the minority class:

number_records_fraud = len(data[data.Class == 1])
fraud_indices = np.array(data[data.Class == 1].index)

We pick the indices of the normal classes:

normal_indices = data[data.Class == 0].index

Out of the indices we picked, we randomly select x number (number_records_fraud):

random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace = False)
random_normal_indices = np.array(random_normal_indices)

We append the two indices:

under_sample_indices = 
  np.concatenate([fraud_indices,random_normal_indices])

Appending the indices under sample dataset:

under_sample_data = data.iloc[under_sample_indices,:]
X_undersample = under_sample_data.iloc[:, under_sample_data.columns != 'Class']
y_undersample = under_sample_data.iloc[:, under_sample_data.columns == 'Class']

On displaying the ratio:

print("Percentage of normal transactions: ", len(under_sample_data[under_sample_data.Class == 0])/float(len(under_sample_data)))
print("Percentage of fraud transactions: ", len(under_sample_data[under_sample_data.Class == 1])/float(len(under_sample_data)))
print("Total number of transactions in resampled data: ", len(under_sample_data))

The output of the preceding code is as follows:

('Percentage of normal transactions: ', 0.5)
 ('Percentage of fraud transactions: ', 0.5)
 ('Total number of transactions in resampled data: ', 984)

On splitting data into train and test sets, cross-validation will be used when calculating accuracies, as follows:

from sklearn.model_selection import train_test_split
# Whole dataset
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)
print("Number transactions train dataset: ", len(X_train))
print("Number transactions test dataset: ", len(X_test))
print("Total number of transactions: ", len(X_train)+len(X_test))
# Undersampled dataset
X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split(X_undersample,y_undersample,test_size = 0.3,random_state = 0)
print("")
print("Number transactions train dataset: ", len(X_train_undersample))
print("Number transactions test dataset: ", len(X_test_undersample))
print("Total number of transactions: ", len(X_train_undersample)+len(X_test_undersample))

The following output shows the distribution that we made from the preceding code:

('Number transactions train dataset: ', 199364)
('Number transactions test dataset: ', 85443)
('Total number of transactions: ', 284807)
('Number transactions train dataset: ', 688)
('Number transactions test dataset: ', 296)
('Total number of transactions: ', 984)

Table of Contents for
Hands-On Machine Learning for Cybersecurity

Approach

Table of Contents for Hands-On Machine Learning for Cybersecurity

Table of Contents for
Hands-On Machine Learning for Cybersecurity