Learning from Transactions

Transactions are useful because they are usually "anonymized", but we can walk through how to pull private information and reidentify individuals and other sensitive data from anonymized transactions.

The data is from a popular Kaggle dataset generated by IBM https://www.kaggle.com/datasets/ealtman2019/credit-card-transactions. Here's a little bit about it:

High-Level Statistics:

Total number of users: 2000
Total number of transactions: 24386900
Total number of cards: 6146
Average transaction amount: 43.63
Median transaction amount: 30.14
Total transaction amount: 1064098117.16
Average yearly income: 45715.88
Median yearly income: 40744.50
Average total debt: 63709.69
Median total debt: 58251.00
Average FICO score: 709.73
Median FICO score: 711.50

Credit Card Transactions

Fraud Detection and Other Analyses

Context

Limited credit card transaction data is available for training fraud detection models and other uses, such as analyzing similar purchase patterns. Credit card data that is available often has significant obfuscation, relatively few transactions, and short time duration. For example, this Kaggle dataset has 284,000 transactions over two days, of which less than 500 are fraudulent. In addition, all but two columns have had a principal components transformation, which obfuscates true values and makes the column values uncorrelated.

Content

The data here has almost no obfuscation and is provided in a CSV file whose schema is described in the first row. This data has more than 20 million transactions generated from a multi-agent virtual world simulation performed by IBM. The data covers 2000 (synthetic) consumers resident in the United States, but who travel the world. The data also covers decades of purchases and includes multiple cards from many of the consumers. Further details about the generation are [here](https://github.com/IBM/ আনুಮಾನ-synthesis-of-transaction-data-for-fraud-detection). Analyses of the data suggest that it is a reasonable match for real data in many dimensions, e.g., fraud rates, purchase amounts, Merchant Category Codes (MCCs), and other metrics. In addition, all columns except merchant name have their "natural" value. Such natural values can be helpful in feature engineering of models.

Understanding the Data

The dataset contains three main tables (users, transactions, and cards), and I've also added merchant category codes (MCCs) since they are a public dataset of merchants.

Here are the database schemas:

CREATE TABLE "users" (
  "Person" TEXT,
  "Current Age" INTEGER,
  "Retirement Age" INTEGER,
  "Birth Year" INTEGER,
  "Birth Month" INTEGER,
  "Gender" TEXT,
  "Address" TEXT,
  "Apartment" REAL,
  "City" TEXT,
  "State" TEXT,
  "Zipcode" TEXT, -- changed from INTEGER to TEXT
  "Latitude" REAL,
  "Longitude" REAL,
  "Per Capita Income - Zipcode" INTEGER, -- changed to INTEGER
  "Yearly Income - Person" INTEGER,      -- changed to INTEGER
  "Total Debt" INTEGER,                  -- changed to INTEGER
  "FICO Score" INTEGER,
  "Num Credit Cards" INTEGER,
  UserID INTEGER
);

CREATE TABLE "transactions" (
  "User" INTEGER,
  "Card" INTEGER,
  "Year" INTEGER,
  "Month" INTEGER,
  "Day" INTEGER,
  "Time" TEXT,
  "Amount" FLOAT,
  "Use Chip" TEXT,
  "Merchant Name" INTEGER,
  "Merchant City" TEXT,
  "Merchant State" TEXT,
  "Zip" TEXT,
  "MCC" INTEGER,
  "Errors?" TEXT,
  "Is Fraud?" TEXT
);

CREATE TABLE "cards" (
  User INT,
  "CARD INDEX" INT,
  "Card Brand" TEXT,
  "Card Type" TEXT,
  "Card Number" INT,
  Expires TEXT,
  CVV INT,
  "Has Chip" TEXT,
  "Cards Issued" INT,
  "Credit Limit" INT,
  "Acct Open Date" TEXT,
  "Year PIN last Changed" INT,
  "Card on Dark Web" TEXT,
  "CreditLimitInt" INTEGER,
  CardID INTEGER
);

CREATE TABLE "mcc_codes" (
  "mcc" INTEGER,
  "edited_description" TEXT,
  "combined_description" TEXT,
  "usda_description" TEXT,
  "irs_description" TEXT,
  "irs_reportable" TEXT
);

Entity-Relationship Diagrams

To better visualize the relationships between these tables, here are the ER diagrams represented using Mermaid syntax:

Users, Cards, Transactions and MCC Codes

Understand the data via exploring analysis

In-Depth Analysis: Credit Card Transactions

As part of understanding our sample datasets, we conducted an exploratory data analysis (EDA) on the synthetic credit card transactions dataset. This page provides an in-depth look at the data’s structure, distributions, and notable patterns—laying the groundwork for subsequent privacy and utility demonstrations.

Overview & Dataset Recap

The dataset simulates 20+ million transactions from approximately 2,000 synthetic consumers over multiple decades. Each transaction includes details such as:

Timestamp (date/time of purchase)
Amount (monetary value)
Merchant Category Code (MCC)
Card number (synthetic)
Fraud flag (indicating fraudulent activity)
Geolocation (approximate latitude/longitude)

Note: The dataset was generated by a multi-agent simulation (IBM) and matches real-world patterns (e.g., spending habits, fraud rates, MCC usage) in many dimensions.

1. Dataset Shape & Basic Statistics

Below is a sample Python-like code snippet that illustrates how one might load and inspect the dataset:

import pandas as pd

# Assume 'transactions.csv' is our data file
df = pd.read_csv('transactions.csv')

print("Number of rows:", len(df))
print("Number of columns:", len(df.columns))
df.head()

Summary

Rows (Transactions): ~20 million
Columns (Features): ~15 to 20 (depending on the final schema)
Time Range: Spans multiple decades of transactions
Memory Footprint: ~2+ GB (uncompressed)

Stat	Value
Approx. Transactions	20 million+
Avg. Transaction ($)	$65.10
Median Transaction ($)	$34.50
Max Transaction ($)	$18,000 (approx.)
Min Transaction ($)	$0 (e.g. test or waived fee)
% Fraudulent	0.7% – 1.0%

Note: The above numbers are approximate; your exact EDA results may vary slightly depending on data processing steps.

2. Transaction Amount Distributions

Histogram of Transaction Amounts

One of the most telling aspects of credit card data is the transaction amount distribution. Below is a rough histogram distribution (in bin ranges):

Amount Range ($)	% of Transactions
0 – 10	12%
10 – 50	38%
50 – 100	22%
100 – 200	15%
200 – 500	8%
500+	5%

A large volume of purchases fall under $50, consistent with everyday expenses like coffee, groceries, or small retail items. High-value purchases ($500+) make up a smaller but non-trivial portion of overall activity—useful for risk detection and fraud analysis (as fraudsters often attempt large-value transactions).

Visualization (Sample)

3. Fraud Distribution & Patterns

Fraud detection is a key focus for many credit card datasets. This synthetic dataset includes a fraud flag (is_fraud) indicating likely fraudulent transactions.

Fraud Frequency

Fraudulent Transactions: ~0.7–1.0%
Legitimate Transactions: ~99–99.3%

Although fraud accounts for a small percentage of total transactions, it remains a critical area for machine learning models.

Fraud by Amount

Transaction Amount	% Fraud (Approx.)
0 – 10	0.3%
10 – 50	0.5%
50 – 100	0.8%
100 – 200	1.2%
200 – 500	1.8%
500+	2.5%

Observation: Fraudulent transactions skew toward higher amounts—unsurprising, as bad actors often attempt high-value purchases.

Seasonal/Monthly Trends

A time-series analysis might reveal spikes in fraud around particular holidays or travel seasons—mirroring real-world patterns.

4. Merchant Category Codes (MCCs)

merchant_category_code (MCC) identifies the type of merchant (e.g., grocery stores, fuel, airlines). MCC data is vital for:

Spending pattern analysis
Fraud detection rules (e.g., suspicious MCC combinations)
Consumer segmentation (e.g., frequent traveler vs. local shopper)

Top MCC Categories

MCC	Merchant Type	% of Transactions
5411	Grocery Stores, Supermarkets	18%
5812	Eating Places, Restaurants	15%
5541	Fuel Stations	10%
5732	Electronics Stores	8%
3000–3350	Hotels & Lodging (Various)	7%
Others	Variety of categories	42%

Exact codes and categories vary; above is a simplified snapshot.

5. Geospatial Insights

The dataset includes latitude/longitude for many transactions, enabling geospatial analysis. Consumers are U.S.-based but may travel abroad. Common findings:

Coastal vs. Inland: Spending clusters along major coasts and urban centers (e.g., NYC, LA).
Travel Patterns: Periods of foreign transactions, e.g., Europe or Asia.
Fraud Hotspots: Some fraudulent rings cluster in certain tourist-heavy locations or e-commerce shipping hubs.

6. Temporal Patterns & Seasonality

Because this dataset covers multiple decades, we can look for long-term trends. Common EDA findings:

Annual Rises: Transactions may consistently spike during December holidays.
Monthly Patterns: Periodic billing cycles or paydays might show peaks at month’s end.
Weekday vs. Weekend: Some categories (restaurants, leisure) see heavier weekend use; other categories (fuel, groceries) remain steady throughout the week.

Deep-diving into these patterns can inform forecasting models, capacity planning, or advanced fraud detection triggers.

7. Potential Data Utility & Use Cases

Fraud Detection Modeling
- Train a classification model using features like transaction amount, MCC, location, time-of-day, etc.
- Evaluate precision/recall on the ~1% fraudulent samples.
Consumer Behavior Segmentation
- Cluster consumers by spending patterns (e.g., frequent travelers vs. local spenders).
- Use merchant categories and transaction frequency as segmentation features.
Anonymization & Privacy Demonstrations
- Show how k-anonymity or differential privacy can mask or aggregate sensitive details (e.g., geolocations).
- Illustrate re-identification risk if data is shared in a naive manner.
Time-Series Forecasting
- Predict monthly or weekly transaction volume for resource planning.
- Identify seasonal spikes (e.g., holidays, back-to-school, major travel periods).

8. Summary & Next Steps

Our exploratory data analysis confirms that this credit card transactions dataset offers:

Rich, realistic behavior across various merchant categories, transaction amounts, and geolocations.
Valuable potential for fraud detection, segmentation, and time-series modeling.
Numerous privacy challenges if shared or published without proper anonymization (as it contains geolocations, potential outliers, and unique spending patterns).

Moving forward, we will use these EDA insights to design privacy protection examples, show how to measure data utility vs. risk, and highlight best practices for sharing sensitive financial data safely.

High-Level Statistics:​

Credit Card Transactions​

Fraud Detection and Other Analyses​

Context​

Content​

Understanding the Data​

Entity-Relationship Diagrams​

Users, Cards, Transactions and MCC Codes​

In-Depth Analysis: Credit Card Transactions

Overview & Dataset Recap​

1. Dataset Shape & Basic Statistics​

Summary​

2. Transaction Amount Distributions​

Histogram of Transaction Amounts​

Visualization (Sample)​

3. Fraud Distribution & Patterns​

Fraud Frequency​

Fraud by Amount​

Seasonal/Monthly Trends​

4. Merchant Category Codes (MCCs)​

Top MCC Categories​

5. Geospatial Insights​

6. Temporal Patterns & Seasonality​

7. Potential Data Utility & Use Cases​

8. Summary & Next Steps​