Skip to main content
warning

THIS INFORMATION IS FOR EDUCATIONAL PURPOSES ONLY AND IS NOT LEGAL ADVICE. IT IS NOT DEFINITIVE AND IS MEANT TO BE ILLUSTRATIVE. ALWAYS CONSULT WITH LEGAL, RISK, AND COMPLIANCE EXPERTS FOR GUIDANCE SPECIFIC TO YOUR SITUATION.

The History of Quantifying Re-identification Risk

Historically, obtaining prescriptive guidance on exact, quantified re-identification risk levels has been extremely difficult. The common refrain is that acceptable risk "depends" on numerous factors, making it challenging to provide universal numerical thresholds.

Why Specific Quantifications Are Rare

Several factors contribute to the reluctance to publicly define specific risk thresholds:

  • Legal Liability: Organizations and regulatory bodies may be hesitant to state specific numbers for fear of creating legal liability if those thresholds are not met or are later deemed insufficient. A specific number could be interpreted as a guarantee, even though re-identification risk is probabilistic and context-dependent.
  • Context-Dependency: Re-identification risk is heavily influenced by factors like data sensitivity, dataset size, available technology, potential attackers' capabilities, and the broader data environment. A "one-size-fits-all" number is rarely appropriate.
  • Evolving Technology: The landscape of re-identification techniques is constantly evolving. A risk threshold considered acceptable today might be inadequate tomorrow due to technological advancements.
  • Auxiliary Data: Attackers often combine datasets, making risk assessment complex and situational.
  • Methodological Challenges: Precisely quantifying re-identification risk is methodologically complex. It often involves assumptions about attacker behavior and background knowledge that are difficult to validate.
  • No Universal Metric: The "reasonableness" of re-identification risk varies based on data sensitivity, attacker motivation, and available resources.

Known Resources and Examples

Despite these challenges, several resources provide valuable insights into minimizing re-identification risk:

  • Handbook on Statistical Disclosure Control for Outputs (Hundepool et al., 2010): This seminal work emphasizes minimizing disclosure risk to a "very low" or "negligible" level. It focuses on Statistical Disclosure Control (SDC) methods like k-anonymity, l-diversity, and t-closeness, which, when properly applied, can help achieve very low risk levels.
  • Statistical Disclosure Limitation (SDL): Principles and Practice of Statistical Data Protection (Willenborg & De Waal, 2000): This book highlights minimal risk as a guiding principle for statistical data protection.
  • The Algorithmic Foundations of Differential Privacy (Dwork & Roth, 2014): This work introduces differential privacy, a technique that provides strong mathematical guarantees about privacy loss, often translating to very low re-identification risks.
  • NISTIR 8053 "De-Identification of Personal Information": Discusses the importance of minimizing re-identification risk and highlights differential privacy as a promising approach.
  • Guidance from Data Protection Authorities: Organizations like the ICO (UK), CNIL (France), and the Article 29 Working Party (pre-GDPR) have provided guidance emphasizing the need for very low re-identification risks, although they typically avoid specific numerical targets.

Risk Levels: A Practical Framework

While precise quantification is difficult, establishing broad risk categories can be helpful for practical discussions. However, these levels should always be interpreted in consultation with legal, risk, and compliance experts.

Here's a classification of re-identification risk levels:

Risk LevelRe-identification ProbabilityDescriptionUsefulness
Very High> 5% (1 in 20)Unacceptable for any sensitive data. Re-identification is relatively easy.Generally not useful if data is considered sensitive.
High1% - 5% (1 in 100 to 1 in 20)Generally unacceptable for sensitive data. Re-identification risk is significant.Generally not useful if data is considered sensitive.
Medium0.01% - 1% (1 in 10,000 to 1 in 100)May be acceptable for some less sensitive data, with strong justifications and controls. Requires careful risk assessment.Potentially acceptable, but generally not useful if data is considered sensitive.
Very Low0.001% - 0.01% (1 in 100,000 to 1 in 10,000)A generally acceptable target range for sensitive data, especially when robust anonymization techniques are employed.Acceptable target range if data is sensitive or if data needs to be anonymized.
Extremely Low< 0.001% (Less than 1 in 100,000)For most practical purposes, data can be considered anonymous, although a non-zero risk may still theoretically exist, if infinitesimally small.For all intents and purposes anonymous.

Why Risk Levels Matter

Most data privacy and protection laws define de-identified or anonymous data based on the reasonable likelihood of re-identification. Here's a table summarizing key aspects of relevant regulations:

RegulationDefinition of De-identified/AnonymousSpecific Attributes to be RemovedQuantified RiskPractical Examples
GDPR (EU)Data where the data subject is not or no longer identifiable, considering all means "reasonably likely to be used."No specific list. Focuses on the risk of re-identification based on direct and indirect identifiers.No specific numerical threshold. Emphasizes minimizing risk to a level where re-identification is not reasonably likely.Singling out, linkability, inference. k-anonymity, ENISA recommendations.
CCPA/CPRA (USA)Information that cannot reasonably be used to infer information about, or otherwise be linked to, a particular consumer.No specific list. Focuses on the "reasonable" risk of re-identification.No specific numerical threshold. Requires "reasonable measures" to prevent re-association and a public commitment not to re-identify.CCPA/CPRA Regulations, FPF & TRUSTe "Demystifying De-identification," IAPP resources.
PIPEDA (Canada)Information where it is not reasonably foreseeable in the circumstances that it could be used, either alone or in combination with other information, to identify an individual.No specific list. Principles-based approach based on the reasonable foreseeability of re-identification.No specific numerical threshold. "Reasonably foreseeable" implies a serious possibility, not just a remote one.OPC's "De-identification Guidelines," OPC's guidance on "Anonymization and the Risks of Re-identification." Example: Dataset of survey responses with age ranges, general geographic areas (e.g., province), and general interests might be considered anonymous.
PIPL (China)Personal information handled to make it impossible to identify specific natural persons and impossible to restore.No specific list. Emphasizes the irreversibility of anonymization.No specific numerical threshold. Sets a high bar of "impossibility" for anonymization.Articles by legal experts and firms specializing in Chinese law (e.g., Covington & Burling, Morrison & Foerster). Techniques like strong encryption of identifiers and secure deletion of linking keys might be necessary.
HIPAA (USA)De-identified data: either (1) Expert Determination ("very small" risk) or (2) Safe Harbor (removal of 18 identifiers).Safe Harbor: 18 specific identifiers (names, geographic subdivisions smaller than a state, elements of dates (except year) related to an individual, phone/fax numbers, email, SSN, medical record numbers, health plan beneficiary numbers, account numbers, certificate/license numbers, vehicle identifiers, device identifiers, URLs, IP addresses, biometric identifiers, full-face photos, any other unique identifying number).Expert Determination: "Very small" risk (no numerical threshold). Safe Harbor: No specific numerical threshold, but the removal of 18 identifiers aims to reduce the risk significantly.HHS Guidance on De-identification, NIST resources on de-identification (NISTIR 8053), OHDSI methods and tools. Example: Dataset containing patient age ranges (e.g., 40-49), general diagnoses (e.g., diabetes), and state of residence might be considered de-identified under Safe Harbor.
FCRA (USA)Not specifically defined in the context of de-identification. The FCRA focuses on "consumer reports" used for credit, insurance, employment, or other authorized purposes.No specific list for de-identification. If the data is not used in a way that it falls under the definition of a consumer report, then it does not apply.No specific threshold. The key question is whether the data constitutes a "consumer report."CFPB guidance, legal interpretations, and case law. Example: Dataset of aggregated credit scores by geographic region, without any information that could be linked back to individual consumers, would likely not be considered a consumer report under the FCRA.

HIPAA's 18 Identifiers (Safe Harbor):

#Identifier
1Names
2All geographic subdivisions smaller than a state, including street address, city, county, precinct, ZIP code, and their equivalent geocodes (with exceptions for initial 3 digits of ZIP codes based on population size)
3All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 and all elements of dates (including year) indicative of such age
4Telephone numbers
5Fax numbers
6Email addresses
7Social Security numbers
8Medical record numbers
9Health plan beneficiary numbers
10Account numbers
11Certificate/license numbers
12Vehicle identifiers and serial numbers, including license plate numbers
13Device identifiers and serial numbers
14Web Universal Resource Locators (URLs)
15Internet Protocol (IP) addresses
16Biometric identifiers, including finger and voice prints
17Full-face photographs and any comparable images
18Any other unique identifying number, characteristic, or code

Putting Risk Levels in Context

Here's a general assessment of how different risk levels might be viewed under various regulations:

Risk LevelGDPRPIPLHIPAACCPA/CPRAFCRA
5% (Very High)Highly unlikely to be acceptable.Unacceptable.Highly unlikely to be acceptable under Expert Determination. Safe Harbor would not apply.Highly unlikely to be considered "reasonable."Likely considered a consumer report if used for FCRA purposes, making anonymization/de-identification insufficient.
1% (High)Highly unlikely to be acceptable.Unacceptable.Highly unlikely to be acceptable under Expert Determination. Safe Harbor would not apply.Highly unlikely to be considered "reasonable."Likely considered a consumer report if used for FCRA purposes, making anonymization/de-identification insufficient.
0.01% (Medium)Potentially acceptable with strong justifications, controls, and robust anonymization techniques.Unlikely to be acceptable.Might be acceptable under Expert Determination, depending on the specific data and context.Might be considered "reasonable" for less sensitive data with strong justifications and controls.May or may not be a consumer report, depending on the specific use. Anonymization/de-identification likely necessary but may be sufficient.
0.001% (Very Low)Generally acceptable target range, especially for sensitive data.Acceptable target range, approaching the "impossibility" standard.Likely acceptable under Expert Determination.Likely considered "reasonable."May or may not be a consumer report, depending on the specific use. Anonymization/de-identification likely necessary but may be sufficient.
< 0.001% (Extremely Low)Acceptable. For all intents and purposes should be considered anonymous.Acceptable. For all intents and purposes should be considered anonymous.Acceptable under Expert Determination.Acceptable. For all intents and purposes should be considered anonymous.Most likely not a consumer report. Anonymization/de-identification highly likely to be sufficient.

Examples:

  • 0.0000000125% (1 in 8 billion): This risk is lower than the probability of randomly identifying an individual on Earth. It's an extremely low risk that would generally be considered anonymous for all practical purposes.
  • 0.001% (1 in 100,000): This risk means that, on average, one person could potentially be re-identified in a dataset of 100,000 individuals. Whether this is acceptable depends on data sensitivity, context, and applicable regulations.
  • Contextual Evaluation: These numerical examples must be considered within the specific context of the data and its use. Factors like population size, data sensitivity, and potential harm from re-identification all play a role in determining acceptable risk. A seemingly low risk for a large population might still be unacceptable for a smaller, more vulnerable population.

Conclusion:

Quantifying re-identification risk is a complex but crucial aspect of data protection. While specific numerical thresholds are rarely prescribed, aiming for a very low risk level (0.01% to 0.001% or lower) is generally a good benchmark for complying with regulations and upholding ethical standards. A holistic approach that combines robust anonymization techniques, strict access controls, strong security measures, and ongoing risk assessment is essential for protecting data.