An important issue confronting retailers and other businesses today is the preponderance of credit card fraud. This issue recently hit home, as my son was a victim a week prior to me writing this.
We can apply machine learning to help detect credit card fraud, but there is a bit of a problem in that the vast majority of transactions are perfectly legitimate, which reduces a typical model’s sensitivity to fraud.
As an example, consider a logistic algorithm running against the Credit Card Fraud dataset posted on Kaggle. You can download it here:
https://www.kaggle.com/mlg-ulb/creditcardfraud
To follow along you will need an installation of Python with the following packages:
You can get all those packages, and many more, with the Anaconda installation which you can find at:
https://www.anaconda.com/download/
To begin with, start off with the necessary imports.
import numpy as np import pandas as pd from sklearn.metrics import confusion_matrix, cohen_kappa_score from sklearn.metrics import f1_score, recall_score
We need NumPy for some basic mathematical functions and Pandas to read in the CSV file and create the data frame. We will use a number of sklearn.metrics to evaluate the results from our models.
Next, we need to create a couple of helper functions. PrintStats will compile and display the results from a model. Here is the code:
def PrintStats(cmat, y_test, pred): # separate out the confusion matrix components tpos = cmat[0][0] fneg = cmat[1][1] fpos = cmat[0][1] tneg = cmat[1][0] # calculate F!, Recall scores f1Score = round(f1_score(y_test, pred), 2) recallScore = round(recall_score(y_test, pred), 2) # calculate and display metrics print(cmat) print( 'Accuracy: '+ str(np.round(100*float(tpos+fneg)/float(tpos+fneg + fpos + tneg),2))+'%') print( 'Cohen Kappa: '+ str(np.round(cohen_kappa_score(y_test, pred),3))) print("Sensitivity/Recall for Model : {recall_score}".format(recall_score = recallScore)) print("F1 Score for Model : {f1_score}".format(f1_score = f1Score))
PrintStats takes as parameters a confusion matrix, test labels and prediction labels and does the following:
We also need a function, called RunModel, to actually train the model and generate predictions against the test data. Here is the code:
def RunModel(model, X_train, y_train, X_test, y_test): model.fit(X_train, y_train.values.ravel()) pred = model.predict(X_test) matrix = confusion_matrix(y_test, pred) return matrix, pred
The RunModel function takes as input the untrained model along with all the test and training data, including labels. It trains the model, runs the prediction using the test data, and returns the confusion matrix along with the predicted labels.
With these two functions created, it’s time to see if we can create a model to do fraud detection. Fraud detection is generally considered a two-class problem. In other words, a transaction is either:
Class #1: Not fraud
Or
Class #2: Fraud
Our goal is to try to determine to which class a particular transaction belongs. Step #1 is to load the CSV data and create the classes. This code will do the trick:
df = pd.read_csv('../Datasets/creditcard.csv') class_names = {0:'Not Fraud', 1:'Fraud'} print(df.Class.value_counts().rename(index = class_names))
It generates the following result:
Not Fraud 284315 Fraud 492 Name: Class, dtype: int64
This is a fairly typical dataset. Out of nearly 300,000 transactions, 492 were labelled as fraudulent. It may not seem like much, but each transaction represents a significant expense. Together, all such fraudulent transactions may represent billions of dollars of lost revenue each year. It also poses a problem with detection. Such a small percentage of fraud transactions makes it more difficult to weed out the offenders from the overwhelming number of good transactions.
Step #2 is to define the features we want to use. Normally, we want to apply some dimension reduction and feature engineering to our data, but that is another article (or two). Instead we’ll just use the whole dataset here with the following code:
feature_names = df.iloc[:, 1:30].columns target = df.iloc[:1, 30: ].columns data_features = df[feature_names] data_target = df[target]
With the dataset defined, step #3 is to split the data into training and test sets. To do this, we need to import another function and run the following code:
from sklearn.model_selection import train_test_split np.random.seed(123) X_train, X_test, y_train, y_test = train_test_split(data_features, data_target, train_size=0.70, test_size=0.30, random_state=1)
The train_test_split function uses a randomizer to separate the data into training and test sets. 70% of the data is for training and 30% is for testing. The random seed is initially set to ensure the same data is used for every run.
For step #4 , we pick a machine learning technique, or model. Perhaps the most common two-class machine learning technique is logistic regression. We will use that for this first test:
from sklearn.linear_model import LogisticRegression lr = LogisticRegression() cmat, pred = RunModel(lr, X_train, y_train, X_test, y_test) PrintStats(cmat, y_test, pred)
The output from this run should look like this:
[[85293 15] [ 57 78]] Accuracy: 99.92% Cohen Kappa: 0.684 Sensitivity/Recall for Model : 0.58 F1 Scorre for Model : 0.68
You might initially think the model did a good job. After all, it got 99.92% of its predictions correct. That is true, except if you look closely at the confusion matrix you will see the following results:
85293 transactions were classified as valid that were actually valid
15 transactions were classified as fraud that were actually valid (type 1 error)
57 transactions were classified as valid that were fraud (type 2 error)
78 transactions were classified as fraud that were fraud
So, while the accuracy was great, we find that the algorithm misclassified more than 4 in 10 fraudulent transactions. In fact, if our algorithm simply classified everything as valid, it would have an accuracy of better than 99.9% but be entirely useless! So, accuracy is not the reliable measure of a model’s effectiveness. Instead, we look at other measures like the Cohen Kappa, Recall, and F1 score. In each case, we want to achieve a score as close to 1 as we can.
Maybe another model will work. How about a RandomForest classifier? The code is similar to logistic regression:
from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier(n_estimators = 100, n_jobs =4) cmat, pred = RunModel(rf, X_train, y_train, X_test, y_test) PrintStats(cmat, y_test, pred)
Trying this classifier will get you results similar to the following:
[[85297 11] [ 31 104]] Accuracy: 99.95% Cohen Kappa: 0.832 Sensitivity/Recall for Model : 0.77 F1 Scorre for Model : 0.83
That’s quite a bit better. Note the accuracy went up slightly, but the other scores showed significant improvements as well. So, one way to improve our detection is to try different models and see how they perform. Clearly changing models helped. But there are other options too. One is over-sampling the sample of fraud records or, conversely, under-sampling the sample of good records. Over-sampling means adding fraud records to our training sample, thereby increasing the overall proportion of fraud records. Conversely, under-sampling is removing valid records from the sample, which has the same effect. Changing the sampling makes the algorithm more “sensitive” to fraud transactions.
Going back to the logistical regression classifier, let’s see how some under-sampling might improve the overall performance of the model. There are specific techniques, such as SMOTE and ADASYN, designed to strategically sample unbalanced datasets. In our case, let’s under-sample in order to achieve an even split between fraud and valid transactions. It will make the training set pretty small, but the algorithm doesn’t need a lot of data to come up with a good classifier:
fraud_records = len(df[df.Class == 1]) # pull the indicies for fraud and valid rows fraud_indices = df[df.Class == 1].index normal_indices = df[df.Class == 0].index # randomly collect equal samples of each type under_sample_indices = np.random.choice(normal_indices, fraud_records, False) df_undersampled = df.iloc[np.concatenate([fraud_indices,under_sample_indices]),:] X_undersampled = df_undersampled.iloc[:,1:30] Y_undersampled = df_undersampled.Class X_undersampled_train, X_undersampled_test, Y_undersampled_train, Y_undersampled_test = train_test_split(X_undersampled,Y_undersampled,test_size = 0.3) lr_undersampled = LogisticRegression(C=1) # run the new model cmat, pred = RunModel(lr_undersampled, X_undersampled_train, Y_undersampled_train, X_undersampled_test, Y_undersampled_test) PrintStats(cmat, Y_undersampled_test, pred)
Now look at the new results:
[[138 1] [ 22 135]] Accuracy: 92.23% Cohen Kappa: 0.845 Sensitivity/Recall for Model : 0.86 F1 Scorre for Model : 0.92
The accuracy went down, but all of the other scores went up. Looking at the confusion matrix, you can see a much higher percentage of correct classifications of fraudulent data.
Unfortunately, there is no free lunch. A higher number of fraud classifications almost always means a correspondingly higher number of valid transactions also classified as fraudulent too. Now try the “new” logistic regression classifier against the original test data:
cmat, pred = RunModel(lr_undersampled, X_undersampled_train, Y_undersampled_train, X_test, y_test) PrintStats(cmat, y_test, pred)
This time, the results are:
[[83757 1551] [ 16 119]] Accuracy: 98.17% Cohen Kappa: 0.129 Sensitivity/Recall for Model : 0.88 F1 Scorre for Model : 0.13
The algorithm was far better at catching fraudulent transactions (16 misclassification to 57) but far worse at mislabeling valid transactions (1551 to 15).
As a data scientist, you have to determine at what point the tradeoff is worth it. Generally, the costs of missing a fraudulent transaction is many times greater than misclassifying a good transaction as fraud. Your job is to find that balance point in your model training and proceed accordingly.
Accelebrate offers Python training onsite and online.
Written by Kevin McCarty
Our live, instructor-led lectures are far more effective than pre-recorded classes
If your team is not 100% satisfied with your training, we do what's necessary to make it right
Whether you are at home or in the office, we make learning interactive and engaging
We accept check, ACH/EFT, major credit cards, and most purchase orders
Alabama
Birmingham
Huntsville
Montgomery
Alaska
Anchorage
Arizona
Phoenix
Tucson
Arkansas
Fayetteville
Little Rock
California
Los Angeles
Oakland
Orange County
Sacramento
San Diego
San Francisco
San Jose
Colorado
Boulder
Colorado Springs
Denver
Connecticut
Hartford
DC
Washington
Florida
Fort Lauderdale
Jacksonville
Miami
Orlando
Tampa
Georgia
Atlanta
Augusta
Savannah
Hawaii
Honolulu
Idaho
Boise
Illinois
Chicago
Indiana
Indianapolis
Iowa
Cedar Rapids
Des Moines
Kansas
Wichita
Kentucky
Lexington
Louisville
Louisiana
New Orleans
Maine
Portland
Maryland
Annapolis
Baltimore
Frederick
Hagerstown
Massachusetts
Boston
Cambridge
Springfield
Michigan
Ann Arbor
Detroit
Grand Rapids
Minnesota
Minneapolis
Saint Paul
Mississippi
Jackson
Missouri
Kansas City
St. Louis
Nebraska
Lincoln
Omaha
Nevada
Las Vegas
Reno
New Jersey
Princeton
New Mexico
Albuquerque
New York
Albany
Buffalo
New York City
White Plains
North Carolina
Charlotte
Durham
Raleigh
Ohio
Akron
Canton
Cincinnati
Cleveland
Columbus
Dayton
Oklahoma
Oklahoma City
Tulsa
Oregon
Portland
Pennsylvania
Philadelphia
Pittsburgh
Rhode Island
Providence
South Carolina
Charleston
Columbia
Greenville
Tennessee
Knoxville
Memphis
Nashville
Texas
Austin
Dallas
El Paso
Houston
San Antonio
Utah
Salt Lake City
Virginia
Alexandria
Arlington
Norfolk
Richmond
Washington
Seattle
Tacoma
West Virginia
Charleston
Wisconsin
Madison
Milwaukee
Alberta
Calgary
Edmonton
British Columbia
Vancouver
Manitoba
Winnipeg
Nova Scotia
Halifax
Ontario
Ottawa
Toronto
Quebec
Montreal
Puerto Rico
San Juan