Learning from Transactions
Transactions are useful because they are usually "anonymized", but we can walk through how to pull private information and reidentify individuals and other sensitive data from anonymized transactions.
The data is from a popular Kaggle dataset generated by IBM https://www.kaggle.com/datasets/ealtman2019/credit-card-transactions. Here's a little bit about it:
High-Level Statistics:
- Total number of users: 2000
- Total number of transactions: 24386900
- Total number of cards: 6146
- Average transaction amount: 43.63
- Median transaction amount: 30.14
- Total transaction amount: 1064098117.16
- Average yearly income: 45715.88
- Median yearly income: 40744.50
- Average total debt: 63709.69
- Median total debt: 58251.00
- Average FICO score: 709.73
- Median FICO score: 711.50
Credit Card Transactions
Fraud Detection and Other Analyses
Context
Limited credit card transaction data is available for training fraud detection models and other uses, such as analyzing similar purchase patterns. Credit card data that is available often has significant obfuscation, relatively few transactions, and short time duration. For example, this Kaggle dataset has 284,000 transactions over two days, of which less than 500 are fraudulent. In addition, all but two columns have had a principal components transformation, which obfuscates true values and makes the column values uncorrelated.
Content
The data here has almost no obfuscation and is provided in a CSV file whose schema is described in the first row. This data has more than 20 million transactions generated from a multi-agent virtual world simulation performed by IBM. The data covers 2000 (synthetic) consumers resident in the United States, but who travel the world. The data also covers decades of purchases and includes multiple cards from many of the consumers. Further details about the generation are [here](https://github.com/IBM/ আনুಮಾನ-synthesis-of-transaction-data-for-fraud-detection). Analyses of the data suggest that it is a reasonable match for real data in many dimensions, e.g., fraud rates, purchase amounts, Merchant Category Codes (MCCs), and other metrics. In addition, all columns except merchant name have their "natural" value. Such natural values can be helpful in feature engineering of models.
Understanding the Data
The dataset contains three main tables (users, transactions, and cards), and I've also added merchant category codes (MCCs) since they are a public dataset of merchants.
Here are the database schemas:
CREATE TABLE "users" (
"Person" TEXT,
"Current Age" INTEGER,
"Retirement Age" INTEGER,
"Birth Year" INTEGER,
"Birth Month" INTEGER,
"Gender" TEXT,
"Address" TEXT,
"Apartment" REAL,
"City" TEXT,
"State" TEXT,
"Zipcode" TEXT, -- changed from INTEGER to TEXT
"Latitude" REAL,
"Longitude" REAL,
"Per Capita Income - Zipcode" INTEGER, -- changed to INTEGER
"Yearly Income - Person" INTEGER, -- changed to INTEGER
"Total Debt" INTEGER, -- changed to INTEGER
"FICO Score" INTEGER,
"Num Credit Cards" INTEGER,
UserID INTEGER
);
CREATE TABLE "transactions" (
"User" INTEGER,
"Card" INTEGER,
"Year" INTEGER,
"Month" INTEGER,
"Day" INTEGER,
"Time" TEXT,
"Amount" FLOAT,
"Use Chip" TEXT,
"Merchant Name" INTEGER,
"Merchant City" TEXT,
"Merchant State" TEXT,
"Zip" TEXT,
"MCC" INTEGER,
"Errors?" TEXT,
"Is Fraud?" TEXT
);
CREATE TABLE "cards" (
User INT,
"CARD INDEX" INT,
"Card Brand" TEXT,
"Card Type" TEXT,
"Card Number" INT,
Expires TEXT,
CVV INT,
"Has Chip" TEXT,
"Cards Issued" INT,
"Credit Limit" INT,
"Acct Open Date" TEXT,
"Year PIN last Changed" INT,
"Card on Dark Web" TEXT,
"CreditLimitInt" INTEGER,
CardID INTEGER
);
CREATE TABLE "mcc_codes" (
"mcc" INTEGER,
"edited_description" TEXT,
"combined_description" TEXT,
"usda_description" TEXT,
"irs_description" TEXT,
"irs_reportable" TEXT
);
Entity-Relationship Diagrams
To better visualize the relationships between these tables, here are the ER diagrams represented using Mermaid syntax:
Users, Cards, Transactions and MCC Codes
Understand the data via exploring analysis
In-Depth Analysis: Credit Card Transactions
As part of understanding our sample datasets, we conducted an exploratory data analysis (EDA) on the synthetic credit card transactions dataset. This page provides an in-depth look at the data’s structure, distributions, and notable patterns—laying the groundwork for subsequent privacy and utility demonstrations.
Overview & Dataset Recap
The dataset simulates 20+ million transactions from approximately 2,000 synthetic consumers over multiple decades. Each transaction includes details such as:
- Timestamp (date/time of purchase)
- Amount (monetary value)
- Merchant Category Code (MCC)
- Card number (synthetic)
- Fraud flag (indicating fraudulent activity)
- Geolocation (approximate latitude/longitude)
Note: The dataset was generated by a multi-agent simulation (IBM) and matches real-world patterns (e.g., spending habits, fraud rates, MCC usage) in many dimensions.
1. Dataset Shape & Basic Statistics
Below is a sample Python-like code snippet that illustrates how one might load and inspect the dataset:
import pandas as pd
# Assume 'transactions.csv' is our data file
df = pd.read_csv('transactions.csv')
print("Number of rows:", len(df))
print("Number of columns:", len(df.columns))
df.head()
Summary
- Rows (Transactions): ~20 million
- Columns (Features): ~15 to 20 (depending on the final schema)
- Time Range: Spans multiple decades of transactions
- Memory Footprint: ~2+ GB (uncompressed)
| Stat | Value |
|---|---|
| Approx. Transactions | 20 million+ |
| Avg. Transaction ($) | $65.10 |
| Median Transaction ($) | $34.50 |
| Max Transaction ($) | $18,000 (approx.) |
| Min Transaction ($) | $0 (e.g. test or waived fee) |
| % Fraudulent | 0.7% – 1.0% |
Note: The above numbers are approximate; your exact EDA results may vary slightly depending on data processing steps.
2. Transaction Amount Distributions
Histogram of Transaction Amounts
One of the most telling aspects of credit card data is the transaction amount distribution. Below is a rough histogram distribution (in bin ranges):
| Amount Range ($) | % of Transactions |
|---|---|
| 0 – 10 | 12% |
| 10 – 50 | 38% |
| 50 – 100 | 22% |
| 100 – 200 | 15% |
| 200 – 500 | 8% |
| 500+ | 5% |
A large volume of purchases fall under $50, consistent with everyday expenses like coffee, groceries, or small retail items. High-value purchases ($500+) make up a smaller but non-trivial portion of overall activity—useful for risk detection and fraud analysis (as fraudsters often attempt large-value transactions).
Visualization (Sample)
3. Fraud Distribution & Patterns
Fraud detection is a key focus for many credit card datasets. This synthetic dataset includes a fraud flag (is_fraud) indicating likely fraudulent transactions.
Fraud Frequency
- Fraudulent Transactions: ~0.7–1.0%
- Legitimate Transactions: ~99–99.3%
Although fraud accounts for a small percentage of total transactions, it remains a critical area for machine learning models.
Fraud by Amount
| Transaction Amount | % Fraud (Approx.) |
|---|---|
| 0 – 10 | 0.3% |
| 10 – 50 | 0.5% |
| 50 – 100 | 0.8% |
| 100 – 200 | 1.2% |
| 200 – 500 | 1.8% |
| 500+ | 2.5% |
Observation: Fraudulent transactions skew toward higher amounts—unsurprising, as bad actors often attempt high-value purchases.
Seasonal/Monthly Trends
A time-series analysis might reveal spikes in fraud around particular holidays or travel seasons—mirroring real-world patterns.
4. Merchant Category Codes (MCCs)
merchant_category_code (MCC) identifies the type of merchant (e.g., grocery stores, fuel, airlines). MCC data is vital for:
- Spending pattern analysis
- Fraud detection rules (e.g., suspicious MCC combinations)
- Consumer segmentation (e.g., frequent traveler vs. local shopper)
Top MCC Categories
| MCC | Merchant Type | % of Transactions |
|---|---|---|
| 5411 | Grocery Stores, Supermarkets | 18% |
| 5812 | Eating Places, Restaurants | 15% |
| 5541 | Fuel Stations | 10% |
| 5732 | Electronics Stores | 8% |
| 3000–3350 | Hotels & Lodging (Various) | 7% |
| Others | Variety of categories | 42% |
Exact codes and categories vary; above is a simplified snapshot.
5. Geospatial Insights
The dataset includes latitude/longitude for many transactions, enabling geospatial analysis. Consumers are U.S.-based but may travel abroad. Common findings:
- Coastal vs. Inland: Spending clusters along major coasts and urban centers (e.g., NYC, LA).
- Travel Patterns: Periods of foreign transactions, e.g., Europe or Asia.
- Fraud Hotspots: Some fraudulent rings cluster in certain tourist-heavy locations or e-commerce shipping hubs.
6. Temporal Patterns & Seasonality
Because this dataset covers multiple decades, we can look for long-term trends. Common EDA findings:
- Annual Rises: Transactions may consistently spike during December holidays.
- Monthly Patterns: Periodic billing cycles or paydays might show peaks at month’s end.
- Weekday vs. Weekend: Some categories (restaurants, leisure) see heavier weekend use; other categories (fuel, groceries) remain steady throughout the week.
Deep-diving into these patterns can inform forecasting models, capacity planning, or advanced fraud detection triggers.
7. Potential Data Utility & Use Cases
-
Fraud Detection Modeling
- Train a classification model using features like transaction amount, MCC, location, time-of-day, etc.
- Evaluate precision/recall on the ~1% fraudulent samples.
-
Consumer Behavior Segmentation
- Cluster consumers by spending patterns (e.g., frequent travelers vs. local spenders).
- Use merchant categories and transaction frequency as segmentation features.
-
Anonymization & Privacy Demonstrations
- Show how k-anonymity or differential privacy can mask or aggregate sensitive details (e.g., geolocations).
- Illustrate re-identification risk if data is shared in a naive manner.
-
Time-Series Forecasting
- Predict monthly or weekly transaction volume for resource planning.
- Identify seasonal spikes (e.g., holidays, back-to-school, major travel periods).
8. Summary & Next Steps
Our exploratory data analysis confirms that this credit card transactions dataset offers:
- Rich, realistic behavior across various merchant categories, transaction amounts, and geolocations.
- Valuable potential for fraud detection, segmentation, and time-series modeling.
- Numerous privacy challenges if shared or published without proper anonymization (as it contains geolocations, potential outliers, and unique spending patterns).
Moving forward, we will use these EDA insights to design privacy protection examples, show how to measure data utility vs. risk, and highlight best practices for sharing sensitive financial data safely.