visit
In this article, I will take you through a case study focus on Credit Card Fraud Detection. It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase. So the main task is to identify fraudulent credit card transactions by using Machine learning. We are going to use a Python library called PyOD which is specifically developed for anomaly detection purposes.
PyOD is a comprehensive and scalable Python toolkit for detecting outlying objects in multivariate data. It has around 20 outlier detection algorithms (supervised and unsupervised). PyOD is developed with a comprehensive API to support multiple techniques and you can take a look at the official documentation of PyOD .
If you are an anomaly detection professional or you want to learn more about anomaly detection then I recommend you try using the PyOD Toolkit.Installing PyOD in Python
Let’s first install PyOD on our machines.pip install pyod # normal install
pip install --pre pyod # pre-release version for new features
git clone //github.com/yzhao062/pyod.git
cd pyod
pip install .
If you plan to use Neural Network-based Models in Pyod, you have to install Keras and other libraries manually in your machine.
# Import important packages
import pandas as pd
import numpy as np
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score,confusion_matrix
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
# Importing KNN module from PyOD
from pyod.models.knn import KNN
from pyod.models.ocsvm import OCSVM
# Import the utility function for model evaluation
from pyod.utils.data import evaluate_print
from pyod.utils.example import visualize
from sklearn.preprocessing import StandardScaler
from cf_matrix import make_confusion_matrix
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
# set seed
np.random.seed(123)
# Load the dataset from csv file by using pandas
data = pd.read_csv("creditcard.csv")
# show columns
data.columns
# print the shape of the data
data.shape
(284807, 31)
# show the first five rows
data.head()
#check missing data
data.isnull().sum()
Our target column is Class contains two classes which are fraud labeled as 1 and not fraud labeled as 0.
# determine number of fraud cases in our file
data.Class.value_counts(normalize=True)
#find the correlation betweeen the variables
corr = data.corr()
fig = plt.figure(figsize=(30,20))
sns.heatmap(corr, vmax=.8, square=True,annot=True)
The above correlation graph shows that V11 variable has a strong positive correlation to the Class variable while the V17 variable has a strong negative correlation to the Class variable.
Because we have many valid transactions, we will use all 10,000 valid cases and 492 fraud cases to create our models.# use sample of the dataset
positive = data[data["Class"]== 1]
negative = data[data["Class"]== 0]
print("positive:{}".format(len(positive)))
print("negative:{}".format(len(negative)))
new_data = pd.concat([positive,negative[:10000]])
#shuffling our dataset
new_data = new_data.sample(frac=1,random_state=42)
new_data.shape
Positive: 492
Negative: 284315
(10492,31)Now we have a total of 10492 numbers of rows.We will standardize the Amount variable by using the method from sklearn. StandardSclaer transforms the data to where there is a mean of 0 and a standard deviation of 1, which means standardizing the data into a normal distribution.#Normalising the amount column.
new_data['Amount'] = StandardScaler().fit_transform(new_data['Amount'].values.reshape(-1,1))
NB. we are not going to use the time variable in this article.
# split into independent variables and target variable
X = new_data.drop(['Time','Class'], axis=1)
y = new_data['Class']
# show the shape of x and y
print("X shape: {}".format(X.shape))
print("y shape: {}".format(y.shape))
X shape: (10492, 29)
y shape: (10492,)
#split the data into train and test
X_train, X_test, y_train,y_test = train_test_split(X,y, test_size = 0.2, stratify=y, random_state=42 )
# create the KNN model
clf_knn = KNN(contamination=0.172, n_neighbors = 5,n_jobs=-1)
clf_knn.fit(X_train)
contamination: The amount of anomalies in the data which for our case = 0.0172
n_neighbors: Number of neighbors to consider for measuring the proximity.
After training our KNN Detector model, we can get the prediction labels on the training data and then get the outlier scores of the training data. The higher the scores are, the more abnormal. This indicates the overall abnormality in the data. These features make PyOD a great utility for anomaly detection tasks.
# Get the prediction labels of the training data
y_train_pred = clf_knn.labels_ # binary labels (0: inliers, 1: outliers)
# Outlier scores
y_train_scores = clf_knn.decision_scores_
We can evaluate KNN() with respect to the training data. PyOD provides a handy function for this task called evaluate_print(). The default metrics include and Precision @ n. We will pass class name, y_train values and y_train_scores(outlier scores as returned by a fitted model.)
# Evaluate on the training data e
valuate_print(‘KNN’, y_train, y_train_scores)
We see that the KNN() model has a good performance on the training data. Let’s plot the confusion matrix for the train set.
import scikitplot as skplt
# plot the comfusion matrix in the train set
skplt.metrics.plot_confusion_matrix(y_train,y_train_pred, normalize=False,title=”Consfusion Matrix on Train Set”)
plt.show()
y_test_scores = clf_knn.decision_function(X_test) # outlier scores
# Evaluate on the training data
evaluate_print('KNN', y_test,y_test_scores)
Our KNN() model continues to perform well on the test set. Let’s plot the confusion matrix for the test set.
# plot the comfusion matrix in the test set
y_preds = clf_knn.predict(X_test)
skplt.metrics.plot_confusion_matrix(y_test,y_preds, normalize=False,
title="Consfusion Matrix on Test Set")
plt.show()
# create the OCSVM model
clf_ocsvm = OCSVM(contamination= 0.172)
clf_ocsvm.fit(X_train)
# Get the prediction labels of the training data
y_train_pred = clf_ocsvm.labels_ # binary labels (0: inliers, 1: outliers)
clf_name ='OCSVM'
# Outlier scores
y_train_scores = clf_ocsvm.decision_scores_
# Evaluate on the training data
evaluate_print(clf_name, y_train, y_train_scores)
# plot the comfusion matrix in the train set
skplt.metrics.plot_confusion_matrix(y_train,y_train_pred,
normalize=False,
title="Consfusion Matrix on
Train Set")
plt.show()
y_test_scores = clf_ocsvm.decision_function(X_test) # outlier scores
# Evaluate on the training data
evaluate_print(clf_name, y_test,y_test_scores)
# plot the comfusion matrix in the test set
y_preds = clf_ocsvm.predict(X_test)
skplt.metrics.plot_confusion_matrix(y_test,y_preds, normalize=False, title=”Consfusion Matrix on Test Set”, ) plt.show()
Also published at