Skip to main content

linkability-risk

Understanding Linkability Risk

Linkability risk is a privacy measure that assesses the potential for an attacker to connect records in the synthetic dataset back to the original dataset, effectively identifying individuals even if their explicit identifiers have been removed. A high linkability risk indicates that the synthetic data preserves relationships or patterns present in the original data, making it possible for an attacker to relink records and compromise privacy.

Pseudo-code Implementation

  1. Split Original Data: Divide the original dataset into two subsets with overlapping individuals but different attributes.
  2. Find Nearest Neighbors: For each record in the subsets, locate its nearest neighbors in the synthetic dataset based on shared attributes.
  3. Identify Links: If two records from the split original dataset share a common nearest neighbor in the synthetic dataset, consider them linked.
  4. Calculate Risk: Determine the proportion of successfully linked records. This proportion represents the linkability risk.

Overview of the Process

Example with Mocked Results

Original Dataset

IDagegenderincome
125Male50K
230Female60K
328Male55K

Synthetic Dataset

agegenderincome
24Male52K
31Female62K
29Male53K

Split Original Data

  • Subset 1: (ID, age) → (1, 25), (2, 30), (3, 28)
  • Subset 2: (ID, income) → (1, 50K), (2, 60K), (3, 55K)

Nearest Neighbors in Synthetic Data

  • (1, 25) → (24, Male, 52K)
  • (2, 30) → (31, Female, 62K)
  • (3, 28) → (29, Male, 53K)
  • (1, 50K) → (24, Male, 52K)
  • (2, 60K) → (31, Female, 62K)
  • (3, 55K) → (29, Male, 53K)

Result Interpretation

All three individuals from the original dataset can be linked using the synthetic dataset. For instance, ID 1 has the same nearest neighbor (24, Male, 52K) in both subsets. Thus, the linkability risk is 3/3 = 1.0 (100%), indicating a high privacy risk.

Deeper Walkthrough with Code Reference

The LinkabilityEvaluator class in the anonymeter library is used to assess the linkability risk. It's initialized with the original, synthetic datasets, and aux_cols specifying the attributes for the attack. The evaluate() method carries out the risk evaluation.

class LinkabilityEvaluator:
# ... (rest of the code) ...

def evaluate(self, n_jobs: int = -2) -> "LinkabilityEvaluator":
# ... (rest of the code) ...

self._attack_links = _linkability_attack(
ori=self._ori, # Original dataset
syn=self._syn, # Synthetic dataset
n_attacks=self._n_attacks, # Number of attack attempts
aux_cols=self._aux_cols, # Attributes used for linking
n_neighbors=self._n_neighbors, # Number of nearest neighbors to consider
n_jobs=n_jobs, # Number of parallel jobs
)

# ... (rest of the code) ...

Within the evaluate() method, the \_linkability_attack function performs the core linkability analysis. It splits the original data, identifies nearest neighbors in the synthetic data, and determines links between records from the split subsets based on shared neighbors. This analysis provides the basis for calculating the linkability risk.