Anonymous Data

Anonymous data is information stripped of every element that could identify a natural person. Unlike pseudonymized data, truly anonymous data falls outside GDPR if the process is irreversible and re-identification is impossible. The line between pseudonymization and true anonymization matters for any analytics system claiming compliance.

Definition

Anonymous data does not relate to an identified or identifiable natural person. It is personal data transformed so the subject can no longer be picked out.

Three Criteria

The Article 29 Working Party defines three tests for true anonymity.

Singling OutLinkabilityInference

Definition:

Cannot isolate records that identify a specific person
No unique combinations of attributes point to one individual

Example:

A single record about an 85-year-old woman in a specific postal code can be singled out

Definition:

Cannot link two records about the same subject or group
Different datasets about one person cannot be correlated

Implementation:

Strip temporal behavioral patterns
Remove stable identifiers
Break links between sources

Definition:

Cannot infer attribute values with high probability
Analysis cannot deduce subject information

Methods:

Statistical noise
Generalization of sensitive attributes
Reduced granularity

Re-identification Reality

Research shows 99.98% of people can be correctly re-identified in anonymized datasets using just 15 characteristics, including age, gender, and marital status. True anonymity in large datasets is hard.

Classical Techniques

K-Anonymity

K-anonymity guarantees each person cannot be distinguished from at least k-1 others on quasi-identifiers.

Core concepts:

Quasi-identifiers: indirect identifiers like age, gender, postal code
Sensitive attributes: protected information that must remain
Equivalence classes: groups of records sharing identical quasi-identifiers

Methods:

Generalization: reduce precision (exact age becomes age range)
Suppression: remove specific values
Anatomization: separate quasi-identifiers from sensitive attributes

graph TD
    A[Original Data] --> B[Identify Attribute Types]
    B --> C[Identifiers]
    B --> D[Quasi-identifiers]
    B --> E[Sensitive Attributes]
    C --> F[Removal]
    D --> G[Generalization/Suppression]
    E --> H[Preservation]
    F --> I[K-Anonymous Dataset]
    G --> I
    H --> I

L-Diversity

L-diversity extends k-anonymity by requiring diversity in sensitive fields. Each equivalence class must contain at least l different values for each sensitive attribute.

Variants:

Distinct l-diversity:

Simplest form, requires l different values

Entropy l-diversity:

Based on entropy of the distribution
Stricter

Recursive (c,l)-diversity:

Caps how often the most frequent value appears

L-diversity Example

In a medical database with quasi-identifiers (age, gender, city) and sensitive attribute (disease), l-diversity with l=3 means every group sharing age, gender, and city must contain at least 3 different diseases.

T-Closeness

T-closeness keeps the distribution of a sensitive attribute within an equivalence class close to its distribution across the whole dataset.

How it works:

Earth Mover's Distance measures distance between distributions
Local distribution stays close to global distribution
Defends against attacks based on knowing the global distribution

Strengths:

Defeats similarity attacks
Accounts for semantic meaning
Stronger guarantees than l-diversity

Modern Methods

Differential Privacy

Differential Privacy provides mathematical guarantees by adding controlled noise to query results.

Formal definition:

Mechanism M provides ε-differential privacy if for all datasets D1 and D2 differing by one record, and for all possible outputs S:

Pr[M(D1) ∈ S] ≤ eᵋ × Pr[M(D2) ∈ S]

Variants:

Centralized DP (CDP)Local DP (LDP)Shuffle Model

Trusted server adds noise to aggregates
Better accuracy
Requires central trust

Noise added on the client before transmission
Stronger guarantees
Lower accuracy

Hybrid of CDP and LDP
Anonymization through shuffling
Balanced tradeoff

Synthetic Data

Synthetic data generation creates new datasets that preserve statistical properties without retaining information about specific people.

Methods:

Generative Adversarial Networks (GANs):

Train generative models on real data
Produce samples with similar properties
No direct copying, lower re-identification risk

Variational Autoencoders (VAEs):

Encode data into latent space
Generate new points from the latent distribution
Tunable similarity vs privacy

Federated Learning

Federated learning trains models across devices without centralizing raw data.

How it works:

Local training on user devices
Only model parameters travel to the server
Updates are aggregated without raw data access

Reinforcements:

Secure Aggregation: cryptographic protection of parameters
Differential Privacy: noise on local updates
Homomorphic Encryption: compute on encrypted data

Homomorphic Encryption

Homomorphic encryption performs arithmetic on ciphertext without decrypting.

Types:

Partially Homomorphic Encryption (PHE):

One operation (addition or multiplication)
High performance
Limited

Somewhat Homomorphic Encryption (SWHE):

Limited operation count
Tradeoff between function and performance

Fully Homomorphic Encryption (FHE):

Unlimited computation
Maximum security
Heavy compute cost

Attacks and Countermeasures

Attack Types

Linkage Attacks:

External sources cross-referenced for re-identification
Correlation across anonymized datasets
Temporal pattern correlation

Homogeneity Attacks:

Exploit lack of diversity in sensitive attributes
Infer details from groups with similar traits

Background Knowledge Attacks:

Combine prior knowledge with anonymized data
Pair public and anonymized records

Arms Race

Re-identification gets easier:

ML for pattern matching
More public datasets
Better correlation analysis

Defenses evolve:

Privacy-enhancing technologies (PETs)
Stronger mathematical guarantees
Updated standards

Recommendations

Choosing a Method

Statistical reporting:

Differential Privacy for aggregates
Synthetic data for detailed analysis
Federated Learning for distributed compute

Machine learning:

Federated Learning with differential privacy
Homomorphic encryption for sensitive compute
Privacy-preserving synthetic data

Research:

Combine techniques
Reassess re-identification risk regularly
Apply data minimization

Evaluating Effectiveness

Motivated Intruder Test

Ask whether a reasonably informed person could re-identify individuals using available resources. If yes, the data is not truly anonymous.

Risk factors:

Rarity of attribute combinations
Available external sources
Attacker capabilities
Stability over time

Continuous monitoring:

Reassess re-identification risks
Track new attack techniques
Update methods

Anonymization is a moving target. Traditional de-identification often falls short against ML-based re-identification. Real anonymization combines multiple techniques, regular risk assessment, and privacy-by-design.

Statable researches advanced anonymization for analytics: differential privacy for statistical reporting, federated learning for distributed analysis, and synthetic data for detailed work. All aligned with international data protection standards.

About AI participation in writing articles

This article, like many others on our site, was created, written and proofread by a team of developers. Of course, not without the participation of AI assistants. We don't hide this and believe that modern systems are already quite good at handling simple tasks and, relatively speaking, writing an article about Viewport yourself is quite strange. It won't come out significantly better and will take a lot of time. But providing basic understanding to beginner webmasters is necessary. Of course, after the article is written by assistants - there's always proofreading, and this is where not one or two people participate, and only after that the article is published.

Ready to ensure true anonymity of your data?

Register for free testing of our analytics platform. Get access to modern anonymization methods, differential privacy mechanisms, and privacy-preserving analytics without technical implementation complexities.

Ready to take control of your web analytics? Try Statable free for 30 days — no credit card required, full feature access, GDPR-compliant by default. Start your free trial or view a live demo.