Credit Card Fraud detection
Intro:
Almost everyone in developed countries use credit cards. This limited people with bad intention, to steal cash. However, credit card frauds are another form of stealing that got popular. Machine learning is used as a guard to protect people from bad intention. The updated presentation slides can be reached here or the PDF version can be downloaded here.
Brief description:
Credit Card frauds is a rising issue. According to www.ussc.gov website there were 472 credit card fraud offenders in 2015 in USA. "6.8% of credit card fraud offenses involved loss amounts greater than $ 1 million". and also "8.0% of these offenses involved 250 or more victims." according to www.idexbiometrics.com credit card fraud losses is expected to exceed $35 billion by 2020 worldwide..
SO credit card frauds is a real issue. and machine learning is a great defender against it. Unfortunately the data available for public use is very limited. One reason is to protect user information the other reason is to protect the information used to train these model from bad intentions.
​
The Data:
The data collected for this work is from kaggle. The dataset is from europian card holders two days transaction in September 2013. Some quick facts about the data.​
-
The data set is highly skewed, consisting of 492 frauds in a total of 284,807 observations. This resulted in only 0.172% fraud cases. This skewed set is justified by the low number of fraudulent transactions.
-
The dataset consists of numerical values from the 28 ‘Principal Component Analysis (PCA)’ transformed features, namely V1 to V28. Furthermore, there is no metadata about the original features provided, so pre-analysis or feature study could not be done.
-
The ‘Time’ and ‘Amount’ features are not transformed data. ​
-
There is no missing value in the dataset.
Method:
The diagram shows the work flow of fraud detector. The data is split into test data and train data. The train data is used to train different predictors and then the predictors asked to predict the response of the split test data. The pest model guesses best response when compared to the actual test response. At this point an important question comes to surface. How reliable is the collected Data?
Why would you need ML (Machine Learning)?
Looking at different features (click the picture to magnify) with naked eye. The fraud data and the normal data centers and standard deviation are very close. There might be slight shift between them. looking into the heat map (The relation between different features in fraud data ) no direct relationship could be detected. While predicting future fraud transactions
How do you implement the ML?
The program used for this work is python the entire python code is attached. Worth mentioning scikit learn
is an open source machine learning library that has different tools for model fitting, data preprocessing. Each of this predictors use different logic to train a model that predicts next value. In general there are there types of predictor logics. classification, Regression and Clustering. As of now each of these categories contain tens of different predictors. For this project only Logistic regression, Random forest, K neighbors, and ADA boost from scikit learn library was used. each of these models yield different future prediction. This is due to important facts. First each estimator have different statistical calculation methods and second each of them rely on different features of the data to make a prediction.
Looking in to the to different pictures to the right The top one shows feature importance for random forest model. While the lower picture shows feature importance for Ada boost. While none of the features in Ada boost model had importance more than 8%. Random forest predictions were mostly depending on V12, V14 and V17.
The prediction weight of feature 17 was 20% for Random forest while it is importance for Ada boost was less significant than 4%.
This example makes it clear why relying on single model might not be the best option for prediction. Simply your model might fail drastically because it weighted its guess on wrong feature. Over-fitting is very common problem is training your data.
The table to the left. ( Found here ) is for 15% test split another interesting table is where at least 1 predictor predicts the transaction to be fraud. (here). While looking into data. We picked two extreme cases
While all predictors predict case 28151 to be fraud. It was not reported to be so (Probably the victim was not even aware of). On other hand case 29424 was reported to be fraud but all predictors predict it to be normal transaction (It is also possible it was mistakenly reported). This brings us to the question asked earlier how reliable is my data? is a model that predicts all the cases correctly will perform the best for future reading? Below are list of predictors. You are welcome to change the parameters and create your own predictor.
Start by clicking a checkbox
By changing the threshold value, One can make a trade off between false negative and false positive
6
By changing the deepness of the random forest, One can make a trade off between false negative and false positive
6
By changing the number of K neighbors, One can make a trade off between false negative and false positive
50
By changing the number of estimators, One can make a trade off between false negative and false positive