K-Anonymity, L-Diversity, T-Closeness

These three models are the backbone of group-based anonymization. They protect privacy by grouping similar records, blocking individual identification, and preserving analytical value.

K-Anonymity

K-anonymity is a property: each record is indistinguishable from at least k-1 other records on a set of quasi-identifiers. It is the foundation for stronger models.

Components

Attribute Classification

Data Types

Direct Identifiers:

Names, social security numbers
Email addresses, phone numbers
Must be removed before anonymization

Quasi-identifiers:

Age, gender, postal code
Birth date, occupation
Can identify indirectly

Sensitive Attributes:

Medical info, income
Political preferences
Not used for grouping but protected

Methods

GeneralizationSuppression

Replace specific values with broader categories.

Examples:

Age 28 becomes range 25-30
City "Almelo" becomes province "Overijssel"
Exact date becomes month and year

Hierarchies:

Specific Address
↓ (generalization)
Street
↓ (generalization)
District
↓ (generalization)
City
↓ (generalization)
Province

Remove or replace values with "*" to create equal-sized groups.

Strategies:

Cell-level suppression
Whole record removal
Rare value suppression

Selection criteria:

Uniqueness
Information loss cost
Effect on equivalence class size

Web Data k-anonymization

Original:

User ID	Age	City	Country	Page Views
1	23	Almelo	Netherlands	45
2	25	Enschede	Netherlands	67
3	28	Hengelo	Netherlands	23
4	31	Almelo	Netherlands	89

After 2-anonymization:

Age Range	Region	Country	Page Views
23-25	Overijssel	Netherlands	45
23-25	Overijssel	Netherlands	67
28-31	Overijssel	Netherlands	23
28-31	Overijssel	Netherlands	89

Limitations

Homogeneity Attack

If every record in an equivalence class shares the same sensitive value, k-anonymity fails.

Background Knowledge Attack

Attackers narrow possible sensitive values using outside information.

graph TD
    A[Raw Data] --> B[Remove Direct Identifiers]
    B --> C[Select Quasi-identifiers]
    C --> D[Group into Equivalence Classes]
    D --> E{Each Class Size ≥ k?}
    E -->|No| F[Additional Generalization/Suppression]
    F --> D
    E -->|Yes| G[K-anonymous Data]

L-Diversity

L-diversity extends k-anonymity by requiring at least l different sensitive-attribute values in each equivalence class.

Variants

Simple l-diversityEntropy l-diversityRecursive (c,l)-diversity

Each class contains at least l distinct sensitive values.

Pros:

Simple
Effective against homogeneity attacks

Cons:

Ignores frequency
Weak with skewed distributions

Uses entropy to measure diversity.

Formula:

H(S) = -Σ p(s) × log(p(s))

where S is the set of sensitive values in the class.

Requirement: H(S) ≥ log(l)

Compromise between simple and entropy variants.

Principle: The most frequent value cannot dominate; less frequent values cannot vanish.

L-diversity Limits

Skewness Attack

If sensitive values in a class are semantically similar, l-diversity falls short.

Similarity Attack

Distinct values with similar meaning can leak information.

T-Closeness

T-closeness requires the sensitive-attribute distribution in any class to stay close to the global distribution.

Distance Metrics

Hellinger Distance

d(P,Q) = (1/√2) × √(Σ(√p_i - √q_i)²)

Earth Mover's Distance (EMD)

Minimum work to transform one distribution into another.

Variational Distance

Simple metric for categorical attributes:

d(P,Q) = (1/2) × Σ|p_i - q_i|

Web Analytics Application

Behavioral DataGeographic Data

Activity categories:

Low activity (< 10 pages/session)
Medium activity (10-50 pages/session)
High activity (> 50 pages/session)

t-closeness: Distribution per group close to overall.

Region distribution:

Each group should reflect overall geographic mix.

Session Data

Overall session duration:

Short (< 5 min): 60%
Medium (5-30 min): 30%
Long (> 30 min): 10%

User group:

Short: 55% (5% deviation)
Medium: 35% (5% deviation)
Long: 10% (0% deviation)

If t = 0.1, the group satisfies t-closeness.

Comparison

Criteria	K-anonymity	L-diversity	T-closeness
ID Protection	Basic	Enhanced	Advanced
Sensitive Attribute Protection	Weak	Good	Excellent
Attack Resistance	Limited	Medium	High
Implementation	Low	Medium	High
Information Loss	Minimal	Moderate	Significant

Combined Use

Best practice applies all three in sequence:

k-anonymity for ID protection
l-diversity for sensitive attribute protection
t-closeness for distribution control

graph LR
    A[Raw Data] --> B[K-anonymity<br/>k=3]
    B --> C[L-diversity<br/>l=2]
    C --> D[T-closeness<br/>t=0.2]
    D --> E[Fully Protected Data]

Implementation

Algorithms

Incognito AlgorithmMondrian Algorithm

Features:

Bottom-up generalization lattice
Search for minimal k-anonymous generalization
Efficient pruning

Example:

def incognito_algorithm(data, k, quasi_identifiers):
    # Build lattice of all possible generalizations
    lattice = build_generalization_lattice(quasi_identifiers)

    # Bottom-up lattice traversal
    for level in lattice.levels():
        for node in level:
            generalized_data = apply_generalization(data, node)
            if is_k_anonymous(generalized_data, k):
                return generalized_data, node

    return None  # k-anonymity unachievable

How it works:

Recursive attribute-space partitioning
Multi-dimensional generalization
Information loss optimization

Pseudocode:

def mondrian(data, k):
    if len(data) < 2*k:
        return generalize(data)

    # Choose attribute for splitting
    split_attr = choose_dimension(data)

    # Split data
    left, right = partition(data, split_attr)

    # Recursive processing
    return mondrian(left, k) + mondrian(right, k)

Quality Metrics

Information Loss

General Loss Metric (GLM): accuracy loss during generalization
Discernibility Metric (DM): equivalence class sizes
Normalized Certainty Penalty (NCP): normalized loss

Privacy

Privacy Level: minimum k across all classes
Diversity Measure: number of distinct sensitive values
Closeness Index: maximum deviation from overall distribution

Parameter Selection

Recommended for web analytics:

k ≥ 5 for basic user data protection
l ≥ 2 for behavioral attributes
t ≤ 0.3 for critical metrics

Selection factors:

Dataset size
Quasi-identifier count
Information sensitivity
Accuracy requirements

Limitations

Scalability

Big Data

High computational complexity with large datasets.

High Dimensions

More quasi-identifiers means higher information loss to reach the same privacy level.

Dynamic Data

Streaming

Traditional methods need adaptation and tradeoffs.

Updates

New records can break anonymity, requiring full recompilation.

Modern Attacks

Downcoding Attack

Exploit deterministic anonymization to recover original data.

Composition Attacks

Combine multiple anonymous releases to recover personal information.

Statable studied these models for analytics. Combined k-anonymity, l-diversity, and t-closeness gives the best balance of privacy and analytical value.

graph TD
    A[Web User Data] --> B[Attribute Classification]
    B --> C[Apply K-anonymity]
    C --> D[Check L-diversity]
    D --> E[Control T-closeness]
    E --> F[Anonymized Data]
    F --> G[Safe Analytics]

Group-based anonymization protects individual privacy while keeping data useful for analysis. Properly applied, it's a core tool in the privacy toolkit.

About AI participation in writing articles

This article, like many others on our site, was created, written and proofread by a team of developers. Of course, not without the participation of AI assistants. We don't hide this and believe that modern systems are already quite good at handling simple tasks and, relatively speaking, writing an article about Viewport yourself is quite strange. It won't come out significantly better and will take a lot of time. But providing basic understanding to beginner webmasters is necessary. Of course, after the article is written by assistants - there's always proofreading, and this is where not one or two people participate, and only after that the article is published.

Need Help with Group-based Anonymization?

Our analytics platform provides built-in algorithms for k-anonymity, l-diversity, and t-closeness. Ensure user data protection while maintaining analytical capabilities.

Ready to take control of your web analytics? Try Statable free for 30 days — no credit card required, full feature access, GDPR-compliant by default. Start your free trial or view a live demo.