singling-out-risk

Understanding Singling Out Risk

Singling out risk is a privacy metric that assesses the likelihood of an attacker identifying a specific individual from the original dataset using only the synthetic dataset. A high singling out risk suggests that the synthetic data retains unique characteristics or outliers that could be exploited to pinpoint individuals from the original data, thus jeopardizing their privacy.

Pseudo-code Implementation

Generate Queries: Create a set of queries (e.g., "age < 25 AND gender == 'Female'") based on the synthetic dataset, targeting unique combinations of attributes.
Evaluate Queries: Run these queries against the original dataset.
Calculate Risk: Determine the proportion of successful queries (those returning a single record from the original data). This proportion represents the singling out risk.

Overview of the Process

Example with Mocked Results

Original Dataset

age	gender
25	Male
30	Female
28	Male
40	Male

Synthetic Dataset

age	gender
24	Male
31	Female
29	Male
41	Male

Queries Generated from Synthetic Data

"age < 25"
"gender == 'Female'"

Evaluation on Original Data

"age < 25" → Matches 1 record (age: 25, gender: Male)
"gender == 'Female'" → Matches 1 record (age: 30, gender: Female)

Result Interpretation

In this scenario, both queries successfully single out individuals from the original dataset based on the synthetic data. Therefore, the singling out risk is 2/2 = 1.0 (100%). This signifies a high privacy risk as all tested queries could isolate individuals in the original data.

Deeper Walkthrough with Code Reference

The SinglingOutEvaluator class in the anonymeter library handles the assessment of singling out risk. Upon initialization, it requires the original and synthetic datasets. The evaluate() method then performs the risk assessment. The core logic of generating and evaluating singling out queries is encapsulated within this method.

def evaluate(self, mode: str = "multivariate") -> "SinglingOutEvaluator":
    # ... (rest of the code) ...

    queries = _generate_singling_out_queries(
        df=self._syn,  # Using the synthetic dataset
        n_attacks=self._n_attacks,  # Number of attack attempts
        n_cols=self._n_cols,  # Number of columns used in queries
        mode=mode,  # Mode of generating queries
        max_attempts=self._max_attempts,  # Maximum attempts to generate queries
    )

    self._queries = _evaluate_queries(df=self._ori, queries=queries)  # Run queries against original data
    # ... (rest of the code) ...

The \_generate_singling_out_queries function (as shown in the code snippet above) is responsible for creating the singling out queries. It employs either a "univariate" or "multivariate" approach based on the mode parameter. In "univariate" mode, queries are crafted from rare, unique attributes. Conversely, the "multivariate" mode constructs queries by combining different attributes to single out records.

Following the creation of queries, the \_evaluate_queries function is used to execute these queries against the original dataset. This function returns a list of successful queries, which are then used to compute the singling out risk.

Understanding Singling Out Risk​

Pseudo-code Implementation​

Overview of the Process​

Example with Mocked Results​

Deeper Walkthrough with Code Reference​

Understanding Singling Out Risk

Pseudo-code Implementation

Overview of the Process

Example with Mocked Results

Deeper Walkthrough with Code Reference