K-Anonymity, L-Diversity, T-Closeness
These three models are the backbone of group-based anonymization. They protect privacy by grouping similar records, blocking individual identification, and preserving analytical value.
K-Anonymity
K-anonymity is a property: each record is indistinguishable from at least k-1 other records on a set of quasi-identifiers. It is the foundation for stronger models.
Components
Attribute Classification
Data Types
Direct Identifiers:
- Names, social security numbers
- Email addresses, phone numbers
- Must be removed before anonymization
Quasi-identifiers:
- Age, gender, postal code
- Birth date, occupation
- Can identify indirectly
Sensitive Attributes:
- Medical info, income
- Political preferences
- Not used for grouping but protected
Methods
Replace specific values with broader categories.
Examples:
- Age 28 becomes range 25-30
- City "Almelo" becomes province "Overijssel"
- Exact date becomes month and year
Hierarchies:
Remove or replace values with "*" to create equal-sized groups.
Strategies:
- Cell-level suppression
- Whole record removal
- Rare value suppression
Selection criteria:
- Uniqueness
- Information loss cost
- Effect on equivalence class size
Web Data k-anonymization
Original:
| User ID | Age | City | Country | Page Views |
|---|---|---|---|---|
| 1 | 23 | Almelo | Netherlands | 45 |
| 2 | 25 | Enschede | Netherlands | 67 |
| 3 | 28 | Hengelo | Netherlands | 23 |
| 4 | 31 | Almelo | Netherlands | 89 |
After 2-anonymization:
| Age Range | Region | Country | Page Views |
|---|---|---|---|
| 23-25 | Overijssel | Netherlands | 45 |
| 23-25 | Overijssel | Netherlands | 67 |
| 28-31 | Overijssel | Netherlands | 23 |
| 28-31 | Overijssel | Netherlands | 89 |
Limitations
Homogeneity Attack
If every record in an equivalence class shares the same sensitive value, k-anonymity fails.
Background Knowledge Attack
Attackers narrow possible sensitive values using outside information.
graph TD
A[Raw Data] --> B[Remove Direct Identifiers]
B --> C[Select Quasi-identifiers]
C --> D[Group into Equivalence Classes]
D --> E{Each Class Size ≥ k?}
E -->|No| F[Additional Generalization/Suppression]
F --> D
E -->|Yes| G[K-anonymous Data]L-Diversity
L-diversity extends k-anonymity by requiring at least l different sensitive-attribute values in each equivalence class.
Variants
Each class contains at least l distinct sensitive values.
Pros:
- Simple
- Effective against homogeneity attacks
Cons:
- Ignores frequency
- Weak with skewed distributions
Uses entropy to measure diversity.
Formula:
where S is the set of sensitive values in the class.Requirement: H(S) ≥ log(l)
Compromise between simple and entropy variants.
Principle: The most frequent value cannot dominate; less frequent values cannot vanish.
L-diversity Limits
Skewness Attack
If sensitive values in a class are semantically similar, l-diversity falls short.
Similarity Attack
Distinct values with similar meaning can leak information.
T-Closeness
T-closeness requires the sensitive-attribute distribution in any class to stay close to the global distribution.
Distance Metrics
Hellinger Distance
Earth Mover's Distance (EMD)
Minimum work to transform one distribution into another.
Variational Distance
Simple metric for categorical attributes:
Web Analytics Application
Activity categories:
- Low activity (< 10 pages/session)
- Medium activity (10-50 pages/session)
- High activity (> 50 pages/session)
t-closeness: Distribution per group close to overall.
Region distribution:
Each group should reflect overall geographic mix.
Session Data
Overall session duration:
- Short (< 5 min): 60%
- Medium (5-30 min): 30%
- Long (> 30 min): 10%
User group:
- Short: 55% (5% deviation)
- Medium: 35% (5% deviation)
- Long: 10% (0% deviation)
If t = 0.1, the group satisfies t-closeness.
Comparison
| Criteria | K-anonymity | L-diversity | T-closeness |
|---|---|---|---|
| ID Protection | Basic | Enhanced | Advanced |
| Sensitive Attribute Protection | Weak | Good | Excellent |
| Attack Resistance | Limited | Medium | High |
| Implementation | Low | Medium | High |
| Information Loss | Minimal | Moderate | Significant |
Combined Use
Best practice applies all three in sequence:
- k-anonymity for ID protection
- l-diversity for sensitive attribute protection
- t-closeness for distribution control
graph LR
A[Raw Data] --> B[K-anonymity<br/>k=3]
B --> C[L-diversity<br/>l=2]
C --> D[T-closeness<br/>t=0.2]
D --> E[Fully Protected Data]Implementation
Algorithms
Features:
- Bottom-up generalization lattice
- Search for minimal k-anonymous generalization
- Efficient pruning
Example:
def incognito_algorithm(data, k, quasi_identifiers):
# Build lattice of all possible generalizations
lattice = build_generalization_lattice(quasi_identifiers)
# Bottom-up lattice traversal
for level in lattice.levels():
for node in level:
generalized_data = apply_generalization(data, node)
if is_k_anonymous(generalized_data, k):
return generalized_data, node
return None # k-anonymity unachievable
How it works:
- Recursive attribute-space partitioning
- Multi-dimensional generalization
- Information loss optimization
Pseudocode:
Quality Metrics
Information Loss
- General Loss Metric (GLM): accuracy loss during generalization
- Discernibility Metric (DM): equivalence class sizes
- Normalized Certainty Penalty (NCP): normalized loss
Privacy
- Privacy Level: minimum k across all classes
- Diversity Measure: number of distinct sensitive values
- Closeness Index: maximum deviation from overall distribution
Parameter Selection
Recommended for web analytics:
- k ≥ 5 for basic user data protection
- l ≥ 2 for behavioral attributes
- t ≤ 0.3 for critical metrics
Selection factors:
- Dataset size
- Quasi-identifier count
- Information sensitivity
- Accuracy requirements
Limitations
Scalability
Big Data
High computational complexity with large datasets.
High Dimensions
More quasi-identifiers means higher information loss to reach the same privacy level.
Dynamic Data
Streaming
Traditional methods need adaptation and tradeoffs.
Updates
New records can break anonymity, requiring full recompilation.
Modern Attacks
Downcoding Attack
Exploit deterministic anonymization to recover original data.
Composition Attacks
Combine multiple anonymous releases to recover personal information.
Statable studied these models for analytics. Combined k-anonymity, l-diversity, and t-closeness gives the best balance of privacy and analytical value.
graph TD
A[Web User Data] --> B[Attribute Classification]
B --> C[Apply K-anonymity]
C --> D[Check L-diversity]
D --> E[Control T-closeness]
E --> F[Anonymized Data]
F --> G[Safe Analytics]Group-based anonymization protects individual privacy while keeping data useful for analysis. Properly applied, it's a core tool in the privacy toolkit.
About AI participation in writing articles
This article, like many others on our site, was created, written and proofread by a team of developers. Of course, not without the participation of AI assistants. We don't hide this and believe that modern systems are already quite good at handling simple tasks and, relatively speaking, writing an article about Viewport yourself is quite strange. It won't come out significantly better and will take a lot of time. But providing basic understanding to beginner webmasters is necessary. Of course, after the article is written by assistants - there's always proofreading, and this is where not one or two people participate, and only after that the article is published.
Need Help with Group-based Anonymization?
Our analytics platform provides built-in algorithms for k-anonymity, l-diversity, and t-closeness. Ensure user data protection while maintaining analytical capabilities.
Ready to take control of your web analytics? Try Statable free for 30 days — no credit card required, full feature access, GDPR-compliant by default. Start your free trial or view a live demo.