Understanding Sample Datasets
This page provides an overview of the sample datasets we’ll use to demonstrate data utility and protection concepts. We plan to start with credit card transaction data—a rich and detailed example—and gradually expand with additional datasets (e.g., demographics, logistics, healthcare) over time.
Why Use Sample Datasets?
Sample (or synthetic) datasets allow us to:
- Experiment Safely: Work with realistic data without risking exposure of real personal or financial records.
- Demonstrate Techniques: Showcase privacy-enhancing technologies, anonymization methods, and data utility strategies.
- Facilitate Collaboration: Share datasets publicly for educational or research purposes, avoiding privacy violations.
Credit Card Transactions
Currently, our primary dataset focuses on credit card transactions, inspired by publicly available resources such as this Kaggle dataset.
Dataset Highlights
- Size: More than 20 million transactions.
- Timespan: Spans decades of purchases, capturing long-term consumer behavior.
- Population: ~2,000 synthetic consumers, primarily U.S.-based but traveling globally.
- Fraud Labeling: Includes flagged fraudulent transactions that approximate real-world fraud rates.
- Multiple Cards: Many consumers have multiple active cards.
- Natural Values: Most columns (except merchant name) retain unobfuscated values, which is especially useful for feature engineering.
In the diagram above:
- A represents the multi-agent simulation performed by IBM to emulate real-world consumer spending.
- B shows the synthetic data generation step, which produces realistic yet fabricated transaction records.
- C represents the final dataset with over 20 million transactions.
- D indicates the structured CSV format with a detailed schema.
- E highlights how we plan to use this data to test and demonstrate data utility and privacy methods.
Why This Dataset?
- Realistic Patterns: Covers typical purchase behaviors, Merchant Category Codes (MCCs), international transactions, and historical contexts.
- Fraud Analysis: The dataset includes fraudulent labels, allowing for use cases like fraud detection modeling.
- Rich Feature Space: Natural values (e.g., geolocations, timestamps) allow for feature engineering in analytics or ML pipelines.
- Open and Synthetic: Freely available and synthetic—no real individuals are at risk.
Draft Schema (High-Level)
While the actual CSV includes many columns, here’s a simplified look at some key fields:
| Column | Description |
|---|---|
transaction_id | Unique identifier for each purchase |
timestamp | Date & time of the transaction |
amount | Monetary value of the transaction |
card_number | Synthetic credit card identifier |
merchant_category_code (mcc) | Category code indicating the type of merchant |
is_fraud | Boolean or numeric flag indicating fraudulent activity |
latitude | Latitudinal coordinate of the purchase location (if any) |
longitude | Longitudinal coordinate of the purchase location (if any) |
customer_id | Synthetic identifier for the consumer |
Note: Some columns may have partial or full obfuscation (e.g., merchant names might be replaced with placeholders).
Future Expansions
We plan to add more datasets over time, such as:
- Demographic datasets (e.g., synthetic census-like data)
- Healthcare (e.g., anonymized patient visits, medical procedures)
- Retail (e.g., ecommerce transactions, inventory management)
- Financial markets (e.g., trading logs with obfuscated tickers)
Each dataset will serve different use cases and demonstrate distinct privacy and utility challenges.
Conclusion
Understanding our sample datasets is crucial for grasping how we’ll apply privacy and utility techniques in practical scenarios. Starting with credit card transactions, we’ll show how synthetic yet realistic data can help us explore:
- De-identification and re-identification risks
- Fraud detection workflows and machine learning pipelines
- Advanced privacy mechanisms (e.g., differential privacy, homomorphic encryption)
Stay tuned for updates on additional sample datasets and new data-driven examples!