linkability-risk
Understanding Linkability Risk
Linkability risk is a privacy measure that assesses the potential for an attacker to connect records in the synthetic dataset back to the original dataset, effectively identifying individuals even if their explicit identifiers have been removed. A high linkability risk indicates that the synthetic data preserves relationships or patterns present in the original data, making it possible for an attacker to relink records and compromise privacy.
Pseudo-code Implementation
- Split Original Data: Divide the original dataset into two subsets with overlapping individuals but different attributes.
- Find Nearest Neighbors: For each record in the subsets, locate its nearest neighbors in the synthetic dataset based on shared attributes.
- Identify Links: If two records from the split original dataset share a common nearest neighbor in the synthetic dataset, consider them linked.
- Calculate Risk: Determine the proportion of successfully linked records. This proportion represents the linkability risk.
Overview of the Process
Example with Mocked Results
Original Dataset
| ID | age | gender | income |
|---|---|---|---|
| 1 | 25 | Male | 50K |
| 2 | 30 | Female | 60K |
| 3 | 28 | Male | 55K |
Synthetic Dataset
| age | gender | income |
|---|---|---|
| 24 | Male | 52K |
| 31 | Female | 62K |
| 29 | Male | 53K |
Split Original Data
- Subset 1: (ID, age) → (1, 25), (2, 30), (3, 28)
- Subset 2: (ID, income) → (1, 50K), (2, 60K), (3, 55K)
Nearest Neighbors in Synthetic Data
- (1, 25) → (24, Male, 52K)
- (2, 30) → (31, Female, 62K)
- (3, 28) → (29, Male, 53K)
- (1, 50K) → (24, Male, 52K)
- (2, 60K) → (31, Female, 62K)
- (3, 55K) → (29, Male, 53K)
Result Interpretation
All three individuals from the original dataset can be linked using the synthetic dataset. For instance, ID 1 has the same nearest neighbor (24, Male, 52K) in both subsets. Thus, the linkability risk is 3/3 = 1.0 (100%), indicating a high privacy risk.
Deeper Walkthrough with Code Reference
The LinkabilityEvaluator class in the anonymeter library is used to assess the linkability risk. It's initialized with the original, synthetic datasets, and aux_cols specifying the attributes for the attack. The evaluate() method carries out the risk evaluation.
class LinkabilityEvaluator:
# ... (rest of the code) ...
def evaluate(self, n_jobs: int = -2) -> "LinkabilityEvaluator":
# ... (rest of the code) ...
self._attack_links = _linkability_attack(
ori=self._ori, # Original dataset
syn=self._syn, # Synthetic dataset
n_attacks=self._n_attacks, # Number of attack attempts
aux_cols=self._aux_cols, # Attributes used for linking
n_neighbors=self._n_neighbors, # Number of nearest neighbors to consider
n_jobs=n_jobs, # Number of parallel jobs
)
# ... (rest of the code) ...
Within the evaluate() method, the \_linkability_attack function performs the core linkability analysis. It splits the original data, identifies nearest neighbors in the synthetic data, and determines links between records from the split subsets based on shared neighbors. This analysis provides the basis for calculating the linkability risk.