Skip to main content

Understanding Sample Datasets

This page provides an overview of the sample datasets we’ll use to demonstrate data utility and protection concepts. We plan to start with credit card transaction data—a rich and detailed example—and gradually expand with additional datasets (e.g., demographics, logistics, healthcare) over time.


Why Use Sample Datasets?

Sample (or synthetic) datasets allow us to:

  1. Experiment Safely: Work with realistic data without risking exposure of real personal or financial records.
  2. Demonstrate Techniques: Showcase privacy-enhancing technologies, anonymization methods, and data utility strategies.
  3. Facilitate Collaboration: Share datasets publicly for educational or research purposes, avoiding privacy violations.

Credit Card Transactions

Currently, our primary dataset focuses on credit card transactions, inspired by publicly available resources such as this Kaggle dataset.

Dataset Highlights

  • Size: More than 20 million transactions.
  • Timespan: Spans decades of purchases, capturing long-term consumer behavior.
  • Population: ~2,000 synthetic consumers, primarily U.S.-based but traveling globally.
  • Fraud Labeling: Includes flagged fraudulent transactions that approximate real-world fraud rates.
  • Multiple Cards: Many consumers have multiple active cards.
  • Natural Values: Most columns (except merchant name) retain unobfuscated values, which is especially useful for feature engineering.

In the diagram above:

  • A represents the multi-agent simulation performed by IBM to emulate real-world consumer spending.
  • B shows the synthetic data generation step, which produces realistic yet fabricated transaction records.
  • C represents the final dataset with over 20 million transactions.
  • D indicates the structured CSV format with a detailed schema.
  • E highlights how we plan to use this data to test and demonstrate data utility and privacy methods.

Why This Dataset?

  1. Realistic Patterns: Covers typical purchase behaviors, Merchant Category Codes (MCCs), international transactions, and historical contexts.
  2. Fraud Analysis: The dataset includes fraudulent labels, allowing for use cases like fraud detection modeling.
  3. Rich Feature Space: Natural values (e.g., geolocations, timestamps) allow for feature engineering in analytics or ML pipelines.
  4. Open and Synthetic: Freely available and synthetic—no real individuals are at risk.

Draft Schema (High-Level)

While the actual CSV includes many columns, here’s a simplified look at some key fields:

ColumnDescription
transaction_idUnique identifier for each purchase
timestampDate & time of the transaction
amountMonetary value of the transaction
card_numberSynthetic credit card identifier
merchant_category_code (mcc)Category code indicating the type of merchant
is_fraudBoolean or numeric flag indicating fraudulent activity
latitudeLatitudinal coordinate of the purchase location (if any)
longitudeLongitudinal coordinate of the purchase location (if any)
customer_idSynthetic identifier for the consumer

Note: Some columns may have partial or full obfuscation (e.g., merchant names might be replaced with placeholders).


Future Expansions

We plan to add more datasets over time, such as:

  • Demographic datasets (e.g., synthetic census-like data)
  • Healthcare (e.g., anonymized patient visits, medical procedures)
  • Retail (e.g., ecommerce transactions, inventory management)
  • Financial markets (e.g., trading logs with obfuscated tickers)

Each dataset will serve different use cases and demonstrate distinct privacy and utility challenges.


Conclusion

Understanding our sample datasets is crucial for grasping how we’ll apply privacy and utility techniques in practical scenarios. Starting with credit card transactions, we’ll show how synthetic yet realistic data can help us explore:

  • De-identification and re-identification risks
  • Fraud detection workflows and machine learning pipelines
  • Advanced privacy mechanisms (e.g., differential privacy, homomorphic encryption)

Stay tuned for updates on additional sample datasets and new data-driven examples!