singling-out-risk
Understanding Singling Out Risk
Singling out risk is a privacy metric that assesses the likelihood of an attacker identifying a specific individual from the original dataset using only the synthetic dataset. A high singling out risk suggests that the synthetic data retains unique characteristics or outliers that could be exploited to pinpoint individuals from the original data, thus jeopardizing their privacy.
Pseudo-code Implementation
- Generate Queries: Create a set of queries (e.g., "age < 25 AND gender == 'Female'") based on the synthetic dataset, targeting unique combinations of attributes.
- Evaluate Queries: Run these queries against the original dataset.
- Calculate Risk: Determine the proportion of successful queries (those returning a single record from the original data). This proportion represents the singling out risk.
Overview of the Process
Example with Mocked Results
Original Dataset
| age | gender |
|---|---|
| 25 | Male |
| 30 | Female |
| 28 | Male |
| 40 | Male |
Synthetic Dataset
| age | gender |
|---|---|
| 24 | Male |
| 31 | Female |
| 29 | Male |
| 41 | Male |
Queries Generated from Synthetic Data
- "age < 25"
- "gender == 'Female'"
Evaluation on Original Data
- "age < 25" → Matches 1 record (age: 25, gender: Male)
- "gender == 'Female'" → Matches 1 record (age: 30, gender: Female)
Result Interpretation
In this scenario, both queries successfully single out individuals from the original dataset based on the synthetic data. Therefore, the singling out risk is 2/2 = 1.0 (100%). This signifies a high privacy risk as all tested queries could isolate individuals in the original data.
Deeper Walkthrough with Code Reference
The SinglingOutEvaluator class in the anonymeter library handles the assessment of singling out risk. Upon initialization, it requires the original and synthetic datasets. The evaluate() method then performs the risk assessment. The core logic of generating and evaluating singling out queries is encapsulated within this method.
def evaluate(self, mode: str = "multivariate") -> "SinglingOutEvaluator":
# ... (rest of the code) ...
queries = _generate_singling_out_queries(
df=self._syn, # Using the synthetic dataset
n_attacks=self._n_attacks, # Number of attack attempts
n_cols=self._n_cols, # Number of columns used in queries
mode=mode, # Mode of generating queries
max_attempts=self._max_attempts, # Maximum attempts to generate queries
)
self._queries = _evaluate_queries(df=self._ori, queries=queries) # Run queries against original data
# ... (rest of the code) ...
The \_generate_singling_out_queries function (as shown in the code snippet above) is responsible for creating the singling out queries. It employs either a "univariate" or "multivariate" approach based on the mode parameter. In "univariate" mode, queries are crafted from rare, unique attributes. Conversely, the "multivariate" mode constructs queries by combining different attributes to single out records.
Following the creation of queries, the \_evaluate_queries function is used to execute these queries against the original dataset. This function returns a list of successful queries, which are then used to compute the singling out risk.