Anonymous Data
Anonymous data is information stripped of every element that could identify a natural person. Unlike pseudonymized data, truly anonymous data falls outside GDPR if the process is irreversible and re-identification is impossible. The line between pseudonymization and true anonymization matters for any analytics system claiming compliance.
Definition
Anonymous data does not relate to an identified or identifiable natural person. It is personal data transformed so the subject can no longer be picked out.
Three Criteria
The Article 29 Working Party defines three tests for true anonymity.
Definition:
- Cannot isolate records that identify a specific person
- No unique combinations of attributes point to one individual
Example:
- A single record about an 85-year-old woman in a specific postal code can be singled out
Definition:
- Cannot link two records about the same subject or group
- Different datasets about one person cannot be correlated
Implementation:
- Strip temporal behavioral patterns
- Remove stable identifiers
- Break links between sources
Definition:
- Cannot infer attribute values with high probability
- Analysis cannot deduce subject information
Methods:
- Statistical noise
- Generalization of sensitive attributes
- Reduced granularity
Re-identification Reality
Research shows 99.98% of people can be correctly re-identified in anonymized datasets using just 15 characteristics, including age, gender, and marital status. True anonymity in large datasets is hard.
Classical Techniques
K-Anonymity
K-anonymity guarantees each person cannot be distinguished from at least k-1 others on quasi-identifiers.
Core concepts:
- Quasi-identifiers: indirect identifiers like age, gender, postal code
- Sensitive attributes: protected information that must remain
- Equivalence classes: groups of records sharing identical quasi-identifiers
Methods:
- Generalization: reduce precision (exact age becomes age range)
- Suppression: remove specific values
- Anatomization: separate quasi-identifiers from sensitive attributes
graph TD
A[Original Data] --> B[Identify Attribute Types]
B --> C[Identifiers]
B --> D[Quasi-identifiers]
B --> E[Sensitive Attributes]
C --> F[Removal]
D --> G[Generalization/Suppression]
E --> H[Preservation]
F --> I[K-Anonymous Dataset]
G --> I
H --> IL-Diversity
L-diversity extends k-anonymity by requiring diversity in sensitive fields. Each equivalence class must contain at least l different values for each sensitive attribute.
Variants:
Distinct l-diversity:
- Simplest form, requires l different values
Entropy l-diversity:
- Based on entropy of the distribution
- Stricter
Recursive (c,l)-diversity:
- Caps how often the most frequent value appears
L-diversity Example
In a medical database with quasi-identifiers (age, gender, city) and sensitive attribute (disease), l-diversity with l=3 means every group sharing age, gender, and city must contain at least 3 different diseases.
T-Closeness
T-closeness keeps the distribution of a sensitive attribute within an equivalence class close to its distribution across the whole dataset.
How it works:
- Earth Mover's Distance measures distance between distributions
- Local distribution stays close to global distribution
- Defends against attacks based on knowing the global distribution
Strengths:
- Defeats similarity attacks
- Accounts for semantic meaning
- Stronger guarantees than l-diversity
Modern Methods
Differential Privacy
Differential Privacy provides mathematical guarantees by adding controlled noise to query results.
Formal definition:
Mechanism M provides ε-differential privacy if for all datasets D1 and D2 differing by one record, and for all possible outputs S:
Pr[M(D1) ∈ S] ≤ eᵋ × Pr[M(D2) ∈ S]
Variants:
- Trusted server adds noise to aggregates
- Better accuracy
- Requires central trust
- Noise added on the client before transmission
- Stronger guarantees
- Lower accuracy
- Hybrid of CDP and LDP
- Anonymization through shuffling
- Balanced tradeoff
Synthetic Data
Synthetic data generation creates new datasets that preserve statistical properties without retaining information about specific people.
Methods:
Generative Adversarial Networks (GANs):
- Train generative models on real data
- Produce samples with similar properties
- No direct copying, lower re-identification risk
Variational Autoencoders (VAEs):
- Encode data into latent space
- Generate new points from the latent distribution
- Tunable similarity vs privacy
Federated Learning
Federated learning trains models across devices without centralizing raw data.
How it works:
- Local training on user devices
- Only model parameters travel to the server
- Updates are aggregated without raw data access
Reinforcements:
- Secure Aggregation: cryptographic protection of parameters
- Differential Privacy: noise on local updates
- Homomorphic Encryption: compute on encrypted data
Homomorphic Encryption
Homomorphic encryption performs arithmetic on ciphertext without decrypting.
Types:
Partially Homomorphic Encryption (PHE):
- One operation (addition or multiplication)
- High performance
- Limited
Somewhat Homomorphic Encryption (SWHE):
- Limited operation count
- Tradeoff between function and performance
Fully Homomorphic Encryption (FHE):
- Unlimited computation
- Maximum security
- Heavy compute cost
Attacks and Countermeasures
Attack Types
Linkage Attacks:
- External sources cross-referenced for re-identification
- Correlation across anonymized datasets
- Temporal pattern correlation
Homogeneity Attacks:
- Exploit lack of diversity in sensitive attributes
- Infer details from groups with similar traits
Background Knowledge Attacks:
- Combine prior knowledge with anonymized data
- Pair public and anonymized records
Arms Race
Re-identification gets easier:
- ML for pattern matching
- More public datasets
- Better correlation analysis
Defenses evolve:
- Privacy-enhancing technologies (PETs)
- Stronger mathematical guarantees
- Updated standards
Recommendations
Choosing a Method
Statistical reporting:
- Differential Privacy for aggregates
- Synthetic data for detailed analysis
- Federated Learning for distributed compute
Machine learning:
- Federated Learning with differential privacy
- Homomorphic encryption for sensitive compute
- Privacy-preserving synthetic data
Research:
- Combine techniques
- Reassess re-identification risk regularly
- Apply data minimization
Evaluating Effectiveness
Motivated Intruder Test
Ask whether a reasonably informed person could re-identify individuals using available resources. If yes, the data is not truly anonymous.
Risk factors:
- Rarity of attribute combinations
- Available external sources
- Attacker capabilities
- Stability over time
Continuous monitoring:
- Reassess re-identification risks
- Track new attack techniques
- Update methods
Anonymization is a moving target. Traditional de-identification often falls short against ML-based re-identification. Real anonymization combines multiple techniques, regular risk assessment, and privacy-by-design.
Statable researches advanced anonymization for analytics: differential privacy for statistical reporting, federated learning for distributed analysis, and synthetic data for detailed work. All aligned with international data protection standards.
About AI participation in writing articles
This article, like many others on our site, was created, written and proofread by a team of developers. Of course, not without the participation of AI assistants. We don't hide this and believe that modern systems are already quite good at handling simple tasks and, relatively speaking, writing an article about Viewport yourself is quite strange. It won't come out significantly better and will take a lot of time. But providing basic understanding to beginner webmasters is necessary. Of course, after the article is written by assistants - there's always proofreading, and this is where not one or two people participate, and only after that the article is published.
Ready to ensure true anonymity of your data?
Register for free testing of our analytics platform. Get access to modern anonymization methods, differential privacy mechanisms, and privacy-preserving analytics without technical implementation complexities.
Ready to take control of your web analytics? Try Statable free for 30 days — no credit card required, full feature access, GDPR-compliant by default. Start your free trial or view a live demo.