Skip to main content

Inference-risk

Understanding Inference Risk

Inference risk is a privacy measure that assesses the possibility of an attacker deducing sensitive information about individuals in the original dataset using the synthetic dataset. A high inference risk implies that the synthetic data, despite not containing the sensitive information directly, may reveal patterns or correlations that allow an attacker to make accurate inferences about this protected information.

Pseudo-code Implementation

  1. Identify Target Records: Select records from the original dataset for the attacker to target.
  2. Find Nearest Neighbors: For each target record, locate its nearest neighbor in the synthetic dataset based on non-sensitive attributes.
  3. Infer Sensitive Information: Use the nearest neighbor's attributes in the synthetic dataset to predict the sensitive attribute of the target record.
  4. Calculate Risk: Determine the proportion of correct inferences made by the attacker. This proportion represents the inference risk.

Overview of the Process

Example with Mocked Results

Original Dataset

agegenderincome
25Male50K
30Female60K
28Male55K

Synthetic Dataset

agegender
24Male
31Female
29Male

Target Records

  • (25, Male)
  • (30, Female)

Nearest Neighbors in Synthetic Data

  • (25, Male) → (24, Male)
  • (30, Female) → (31, Female)

Inferred and Actual Income

agegenderinferred_incomeactual_income
25Male50K50K
30Female60K60K

Result Interpretation

The attacker successfully inferred the income for both target individuals. For instance, the nearest neighbor of (25, Male) in the synthetic dataset is (24, Male). Assuming similar individuals have similar incomes, the attacker infers an income of 50K, which is correct. The inference risk here is 2/2 = 1.0 (100%), indicating a high privacy risk.

Deeper Walkthrough with Code Reference

The InferenceEvaluator class in the anonymeter library is used to assess the inference risk. It's initialized with the original, synthetic datasets, aux_cols (non-sensitive attributes used for inference), and the secret (sensitive attribute to be inferred). The evaluate() method performs the risk evaluation.

class InferenceEvaluator:
# ... (rest of the code) ...

def evaluate(self, n_jobs: int = -2) -> "InferenceEvaluator":
# ... (rest of the code) ...

self._n_success = _run_attack(
target=self._ori, # Original dataset
syn=self._syn, # Synthetic dataset
n_attacks=self._n_attacks, # Number of attack attempts
aux_cols=self._aux_cols, # Non-sensitive attributes used for inference
secret=self._secret, # Sensitive attribute to be inferred
n_jobs=n_jobs, # Number of parallel jobs
naive=False, # Flag for naive attack (random guessing)
regression=self._regression, # Flag for regression-type inference
)

# ... (rest of the code) ...

Inside the evaluate() method, the \_run_attack function performs the core inference analysis. It finds nearest neighbors in the synthetic dataset based on aux_cols and then uses the secret attribute of these neighbors to infer the sensitive information of target records. This analysis forms the basis for calculating the inference risk.