Credit Card Fraud Detection: How to handle an imbalanced dataset

Published in

Analytics Vidhya

7 min readFeb 5, 2021

This post will be focused on the step-by-step project and the result, you can view my code in my Github.

tags: machine learning (logistic regression), python , jupyter notebook , imbalanced dataset (random undersampling, smote)

Introduction

Credit card fraud is an inclusive term for fraud committed using a payment card, such as a credit card or debit card. The purpose may be to obtain goods or services or to make payment to another account that is controlled by a criminal. (Wikipedia)

This has been a major problem to the victim and the credit card company itself, as it inflicts a financial loss to both parties. It is important for a credit card company to detect which transaction is fraud and which one is not.

With machine learning, we can detect the fraud transaction from historical data and denied the transaction so that both the company and the individual won’t have a loss.

Purpose

There are a lot of ways we can do to handle an imbalanced dataset, in this project we will compare each technique (Random-Under Sampling and SMOTE) to see which technique fits the best for this imbalanced dataset.

To detect a fraud detection, we can use machine learning, in which there are a lot of machine learning algorithm out there. In this project we will also see which machine learning models fits the best out of Logistic Regression, K-Nearest Neighbor, Support Vector Machine, and Random Forest.

Data

The datasets contains transactions made by credit cards in September 2013 by european cardholders, it presents transactions that occurred in two days. It contains only numerical input variables which are the result of a PCA transformation. The only features which have not been transformed with PCA are Time and Amount feature.

Time feature contains the seconds elapsed between each transaction and the first transaction in the dataset. While Amount is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature Class is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Data Source : Kaggle

Methods

The machine learning methods useds is Classification, the purpose of this project is to classify each transaction to Fraud and Non-Fraud, and compare which classification algorithm fits the best for this dataset.

The classification algorithm used to compare is Logistic Regression, K-Nearest Neighbor, Support Vector Machine, and Decision Tree.

Analysis

The dataset of the credit card transaction shows that this dataset is imbalanced, as we can see from the figure above.

**Figure 1.** Class distribution shows an imbalanced dataset

When we’re dealing with an imbalanced dataset, we can’t just simply use it raw and proced it into machine learning. It could cause a biased to the majority of the class that leads to a poor machine learning model.

So, we have to handle the imbalanced dataset first.

There are a lot of technique we can use to handle an imbalanced dataset, like Random Under-Sampling and Random Over-Sampling. In this project, I will use Random Under-Sampling and SMOTE to handle the imbalanced dataset and compare which one fits the best for this dataset.

SMOTE is a technique where you do over-sampling to the minority class by filling out the gap between each value and then do under-sampling for the majority class so it meets in the middle.

**Figure 2.** Class distribution after performing SMOTE

We can see from Figure 2 that the class distribution becomes equal after performing SMOTE.

This dataset doesn’t have a missing value so we don’t have to handle it.

By performing this technique, we can also see how different it is to look at the correlation between features, before and after performing SMOTE.

We can see that before performing SMOTE, we can’t see the correlation between each feature, but when the data is balanced, we can clearly see the correlation.

**Figure 4.** Boxplot of Each feature categorize by Class

Here we can see that some of the feature there are a clear range between the class. We can also see that there are a lot of outliers. So we will remove the extreme outliers from the feature that have a high correlation with the class.

From the boxplot, we can see that the dataset that have a negative and positive correlation with the class:

Negative Correlation: Time V2 V6 V8 V9 V11 V13 V15 V16 V17

Positive Correlation: V1 V3 V10 V18

and we can see that feature that have a high correlation with the class is V2 V3 V8 V10 V11 V13 V15 V16 V17 V18 so we will remove outliers from this dataset.

Here are the comparison of before and after removing the outliers on one of the feature.

**Figure 5.** Before and after removing outliers

The threshold used for removing the extreme outliers is 1.5 of IQR. The cut off value will be used to determine the range, which the lower range starts from the Q25 — cut off valueuntil the upper range which is Q75 + cut off value.

The value outside of those range will be removed.

After removing outliers, we can see that there is a slight decreasing in the fraud case (Class 1) dataset. Although the data looks imbalanced, there is not a huge different to consider this dataset is imbalanced.

Usually you would consider a data is imbalanced if the ratio is around 8 : 2.

**Figure 6.** Class distribution after removing outliers

Lastly, we will do Random-Under Sampling, we can do it by using the imblearn libraries. We will compare all 4 of the existing dataset, which is :

The raw/ original dataset
The dataset after SMOTE
The dataset after SMOTE and removing extreme outliers
The dataset after Random Under Sampling

If you are dealing with an imbalanced dataset, it’s not good to have accuracy as your parameter, because the accuracy will give a high value but it’s only because they succeed in predicting the majority class (this is called biased). So that’s why I am using Precision and Recall as a parameter to decide which one have the best performance.

Precision and Recall only look at the class without an effect from the other so we won’t have to worried to have a biased results even though our data is imbalanced.

I will implement Logistic Regression to the 4 of the dataset above and then compare them.

Results

Here we can see that the dataset that’s been through SMOTE gives out the best precision-recall curve. Here we can see that removing outliers can actually discard a useful information, same goes to the Random Under-Sampling, we can lose a lot of useful information that can help us classify the class. But still both of the dataset have a higher precision-recall than the raw dataset that hasn’t been properly handled yet.

We can also see from the ROC curve that using SMOTE give us the best results. With ROC curve, it can helps you decide which method is better than the other.

Conclusion

The best technique to handle imbalanced dataset for this Credit Card Fraud Detection is SMOTE (Synthetic Minority Oversampling Technique).

P.S

I am actually working on comparing which machine learning algorithm fits best but the result is kind off weird so I decided not to put it here and try it again. There are some theory that I haven’t completely grasped on and I am working on it.

Probably I will add it in the next few days and use GridSearchCV (I haven’t fully understand it).I can also analyze which that have a high correlation with the class (not manually because that’s what I did in here, well technically I can use a scatter plot), or which feature that can help us differentiate the class.

Please send me your comment and advice if you notice something’s wrong. It would help me a lot. I am a newbie in this so please go easy on me!

Comment:

This is not a big project but I learned a lot through this project, it helped me understand a lot of machine learning parameters we can use to evaluate our models and different methods we can use to handle an imbalanced dataset.