- We are not going to perform feature engineering in the first instance. The dataset has been downgraded in order to contain 30 features (28 anonymized + time + amount).
- We compare what happens when using resampling and when not using it. We test this approach using a simple logistic regression classifier.
- We evaluate the models by using some of the performance metrics mentioned previously.
- We repeat the best resampling/not-resampling method by tuning the parameters in the logistic-regression classifier.
- We perform a classifications model using other classification algorithms.
Setting our input and target variables + resampling:
- Normalize the Amount column
- The Amount column is not in line with the anonymized features:
from sklearn.preprocessing import StandardScaler data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1)) data = data.drop(['Time','Amount'],axis=1) data.head()
As we mentioned earlier, there are several ways to resample skewed data. Apart from under-sampling and oversampling, there is a very popular approach called SMOTE, which is a combination of oversampling and under-sampling, but the oversampling approach is not done by replicating a minority class but by constructing a new minority class data instance via an algorithm.
In this notebook, we will use traditional under-sampling.
The way we will under-sample the dataset is by creating a 50:50 ratio. This will be done by randomly selecting x number of samples from the majority class, being x the total number of records with the minority class:
X = data.iloc[:, data.columns != 'Class'] y = data.iloc[:, data.columns == 'Class']
We count the number of data points in the minority class:
number_records_fraud = len(data[data.Class == 1])
fraud_indices = np.array(data[data.Class == 1].index)
We pick the indices of the normal classes:
normal_indices = data[data.Class == 0].index
Out of the indices we picked, we randomly select x number (number_records_fraud):
random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace = False)
random_normal_indices = np.array(random_normal_indices)
We append the two indices:
under_sample_indices =
np.concatenate([fraud_indices,random_normal_indices])
Appending the indices under sample dataset:
under_sample_data = data.iloc[under_sample_indices,:]
X_undersample = under_sample_data.iloc[:, under_sample_data.columns != 'Class']
y_undersample = under_sample_data.iloc[:, under_sample_data.columns == 'Class']
On displaying the ratio:
print("Percentage of normal transactions: ", len(under_sample_data[under_sample_data.Class == 0])/float(len(under_sample_data)))
print("Percentage of fraud transactions: ", len(under_sample_data[under_sample_data.Class == 1])/float(len(under_sample_data)))
print("Total number of transactions in resampled data: ", len(under_sample_data))
The output of the preceding code is as follows:
('Percentage of normal transactions: ', 0.5)
('Percentage of fraud transactions: ', 0.5)
('Total number of transactions in resampled data: ', 984)
On splitting data into train and test sets, cross-validation will be used when calculating accuracies, as follows:
from sklearn.model_selection import train_test_split
# Whole dataset
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)
print("Number transactions train dataset: ", len(X_train))
print("Number transactions test dataset: ", len(X_test))
print("Total number of transactions: ", len(X_train)+len(X_test))
# Undersampled dataset
X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split(X_undersample,y_undersample,test_size = 0.3,random_state = 0)
print("")
print("Number transactions train dataset: ", len(X_train_undersample))
print("Number transactions test dataset: ", len(X_test_undersample))
print("Total number of transactions: ", len(X_train_undersample)+len(X_test_undersample))
The following output shows the distribution that we made from the preceding code:
('Number transactions train dataset: ', 199364)
('Number transactions test dataset: ', 85443)
('Total number of transactions: ', 284807)
('Number transactions train dataset: ', 688)
('Number transactions test dataset: ', 296)
('Total number of transactions: ', 984)