Creating Credit Card Transactions Datasets
Creating Test Data: 2017-2019 Transactions
For this evaluation of data protection methods, we will focus on a subset of the transaction data: approximately 5 million transactions from the years 2017, 2018, and 2019. The original dataset spans from 1991 to 2020. This time slice allows us to:
- Test Data Protection Against Old Data: Assess how well privacy techniques protect historical data.
- Test Against Future Values: Evaluate how well models trained on anonymized data from this period generalize to future, unseen data.
Splitting Logic (Python)
Here's the Python code using Pandas to create the test and training datasets from the selected time slice:
import pandas as pd
from sklearn.model_selection import train_test_split
# 1. Load the transactions data
transactions_df = pd.read_csv("transactions.csv")
# 2. Filter transactions for 2017, 2018, and 2019
filtered_transactions = transactions_df[
transactions_df["Year"].isin([2017, 2018, 2019])
]
# 3. Sample approximately 5 million transactions
sampled_transactions = filtered_transactions.sample(
n=5000000, random_state=42
) # Using a fixed random_state for reproducibility
# 4. Create Is_Fraud Column (example logic - adjust as needed)
sampled_transactions['Is Fraud?'] = sampled_transactions['Errors?'].str.contains('Insufficient Balance', na=False).astype(int)
# 5. Split into training and testing sets (e.g., 80% train, 20% test)
# Stratify by 'Is Fraud?' to maintain the proportion of fraud/non-fraud cases
train_df, test_df = train_test_split(
sampled_transactions,
test_size=0.2,
random_state=42,
stratify=sampled_transactions['Is Fraud?'],
)
# 6. Output to CSV files
train_df.to_csv("train_transactions.csv", index=False)
test_df.to_csv("test_transactions.csv", index=False)
Explanation
- Load Data: The code starts by loading the entire
transactions.csvinto a Pandas DataFrame. - Filter by Year: We filter the DataFrame to include only transactions from the years 2017, 2018, and 2019.
- Sample Transactions: We then randomly sample approximately 5 million transactions from this filtered set.
- Create "Is Fraud?" Column: An example of how you might create a target variable for a potential downstream task (like fraud detection) is provided. You might need to adjust the logic based on your specific needs.
- Train/Test Split: The
train_test_splitfunction fromsklearn.model_selectionis used to split the data into training and testing sets.test_size=0.2means 20% of the data will be used for testing.random_state=42ensures that the split is reproducible.stratify=sampled_transactions['Is Fraud?']ensures that the proportion of fraudulent transactions is similar in both the training and testing sets. This is important if you are considering fraud detection as a downstream task, as fraud is often a rare event.
- Output to CSV: Finally, the training and testing DataFrames are saved as separate CSV files (
train_transactions.csvandtest_transactions.csv). This makes it easy to share, load, and visualize the data later.
Data Splitting Logic Diagram
By following these steps, we will have created focused training and testing datasets from the years 2017-2019. These datasets can then be used to evaluate the effectiveness of various data protection methods in preserving data utility while protecting sensitive information.