linkability-risk

Understanding Linkability Risk

Linkability risk is a privacy measure that assesses the potential for an attacker to connect records in the synthetic dataset back to the original dataset, effectively identifying individuals even if their explicit identifiers have been removed. A high linkability risk indicates that the synthetic data preserves relationships or patterns present in the original data, making it possible for an attacker to relink records and compromise privacy.

Pseudo-code Implementation

Split Original Data: Divide the original dataset into two subsets with overlapping individuals but different attributes.
Find Nearest Neighbors: For each record in the subsets, locate its nearest neighbors in the synthetic dataset based on shared attributes.
Identify Links: If two records from the split original dataset share a common nearest neighbor in the synthetic dataset, consider them linked.
Calculate Risk: Determine the proportion of successfully linked records. This proportion represents the linkability risk.

Overview of the Process

Example with Mocked Results

Original Dataset

ID	age	gender	income
1	25	Male	50K
2	30	Female	60K
3	28	Male	55K

Synthetic Dataset

age	gender	income
24	Male	52K
31	Female	62K
29	Male	53K

Split Original Data

Subset 1: (ID, age) → (1, 25), (2, 30), (3, 28)
Subset 2: (ID, income) → (1, 50K), (2, 60K), (3, 55K)

Nearest Neighbors in Synthetic Data

(1, 25) → (24, Male, 52K)
(2, 30) → (31, Female, 62K)
(3, 28) → (29, Male, 53K)
(1, 50K) → (24, Male, 52K)
(2, 60K) → (31, Female, 62K)
(3, 55K) → (29, Male, 53K)

Result Interpretation

All three individuals from the original dataset can be linked using the synthetic dataset. For instance, ID 1 has the same nearest neighbor (24, Male, 52K) in both subsets. Thus, the linkability risk is 3/3 = 1.0 (100%), indicating a high privacy risk.

Deeper Walkthrough with Code Reference

The LinkabilityEvaluator class in the anonymeter library is used to assess the linkability risk. It's initialized with the original, synthetic datasets, and aux_cols specifying the attributes for the attack. The evaluate() method carries out the risk evaluation.

class LinkabilityEvaluator:
    # ... (rest of the code) ...

    def evaluate(self, n_jobs: int = -2) -> "LinkabilityEvaluator":
        # ... (rest of the code) ...

        self._attack_links = _linkability_attack(
            ori=self._ori,  # Original dataset
            syn=self._syn,  # Synthetic dataset
            n_attacks=self._n_attacks,  # Number of attack attempts
            aux_cols=self._aux_cols,  # Attributes used for linking
            n_neighbors=self._n_neighbors,  # Number of nearest neighbors to consider
            n_jobs=n_jobs,  # Number of parallel jobs
        )

        # ... (rest of the code) ...

Within the evaluate() method, the \_linkability_attack function performs the core linkability analysis. It splits the original data, identifies nearest neighbors in the synthetic data, and determines links between records from the split subsets based on shared neighbors. This analysis provides the basis for calculating the linkability risk.

Understanding Linkability Risk​

Pseudo-code Implementation​

Overview of the Process​

Example with Mocked Results​

Deeper Walkthrough with Code Reference​

Understanding Linkability Risk

Pseudo-code Implementation

Overview of the Process

Example with Mocked Results

Deeper Walkthrough with Code Reference