Skip to content

K-Anonymity, L-Diversity, T-Closeness

These three models are the backbone of group-based anonymization. They protect privacy by grouping similar records, blocking individual identification, and preserving analytical value.

K-Anonymity

K-anonymity is a property: each record is indistinguishable from at least k-1 other records on a set of quasi-identifiers. It is the foundation for stronger models.

Components

Attribute Classification

Data Types

Direct Identifiers:

  • Names, social security numbers
  • Email addresses, phone numbers
  • Must be removed before anonymization

Quasi-identifiers:

  • Age, gender, postal code
  • Birth date, occupation
  • Can identify indirectly

Sensitive Attributes:

  • Medical info, income
  • Political preferences
  • Not used for grouping but protected

Methods

Replace specific values with broader categories.

Examples:

  • Age 28 becomes range 25-30
  • City "Almelo" becomes province "Overijssel"
  • Exact date becomes month and year

Hierarchies:

Specific Address
↓ (generalization)
Street
↓ (generalization)
District
↓ (generalization)
City
↓ (generalization)
Province

Remove or replace values with "*" to create equal-sized groups.

Strategies:

  • Cell-level suppression
  • Whole record removal
  • Rare value suppression

Selection criteria:

  • Uniqueness
  • Information loss cost
  • Effect on equivalence class size

Web Data k-anonymization

Original:

User IDAgeCityCountryPage Views
123AlmeloNetherlands45
225EnschedeNetherlands67
328HengeloNetherlands23
431AlmeloNetherlands89

After 2-anonymization:

Age RangeRegionCountryPage Views
23-25OverijsselNetherlands45
23-25OverijsselNetherlands67
28-31OverijsselNetherlands23
28-31OverijsselNetherlands89

Limitations

Homogeneity Attack

If every record in an equivalence class shares the same sensitive value, k-anonymity fails.

Background Knowledge Attack

Attackers narrow possible sensitive values using outside information.

graph TD
    A[Raw Data] --> B[Remove Direct Identifiers]
    B --> C[Select Quasi-identifiers]
    C --> D[Group into Equivalence Classes]
    D --> E{Each Class Size ≥ k?}
    E -->|No| F[Additional Generalization/Suppression]
    F --> D
    E -->|Yes| G[K-anonymous Data]

L-Diversity

L-diversity extends k-anonymity by requiring at least l different sensitive-attribute values in each equivalence class.

Variants

Each class contains at least l distinct sensitive values.

Pros:

  • Simple
  • Effective against homogeneity attacks

Cons:

  • Ignores frequency
  • Weak with skewed distributions

Uses entropy to measure diversity.

Formula:

H(S) = -Σ p(s) × log(p(s))
where S is the set of sensitive values in the class.

Requirement: H(S) ≥ log(l)

Compromise between simple and entropy variants.

Principle: The most frequent value cannot dominate; less frequent values cannot vanish.

L-diversity Limits

Skewness Attack

If sensitive values in a class are semantically similar, l-diversity falls short.

Similarity Attack

Distinct values with similar meaning can leak information.

T-Closeness

T-closeness requires the sensitive-attribute distribution in any class to stay close to the global distribution.

Distance Metrics

Hellinger Distance

d(P,Q) = (1/√2) × √(Σ(√p_i - √q_i)²)

Earth Mover's Distance (EMD)

Minimum work to transform one distribution into another.

Variational Distance

Simple metric for categorical attributes:

d(P,Q) = (1/2) × Σ|p_i - q_i|

Web Analytics Application

Activity categories:

  • Low activity (< 10 pages/session)
  • Medium activity (10-50 pages/session)
  • High activity (> 50 pages/session)

t-closeness: Distribution per group close to overall.

Region distribution:

Each group should reflect overall geographic mix.

Session Data

Overall session duration:

  • Short (< 5 min): 60%
  • Medium (5-30 min): 30%
  • Long (> 30 min): 10%

User group:

  • Short: 55% (5% deviation)
  • Medium: 35% (5% deviation)
  • Long: 10% (0% deviation)

If t = 0.1, the group satisfies t-closeness.

Comparison

CriteriaK-anonymityL-diversityT-closeness
ID ProtectionBasicEnhancedAdvanced
Sensitive Attribute ProtectionWeakGoodExcellent
Attack ResistanceLimitedMediumHigh
ImplementationLowMediumHigh
Information LossMinimalModerateSignificant

Combined Use

Best practice applies all three in sequence:

  1. k-anonymity for ID protection
  2. l-diversity for sensitive attribute protection
  3. t-closeness for distribution control
graph LR
    A[Raw Data] --> B[K-anonymity<br/>k=3]
    B --> C[L-diversity<br/>l=2]
    C --> D[T-closeness<br/>t=0.2]
    D --> E[Fully Protected Data]

Implementation

Algorithms

Features:

  • Bottom-up generalization lattice
  • Search for minimal k-anonymous generalization
  • Efficient pruning

Example:

def incognito_algorithm(data, k, quasi_identifiers):
    # Build lattice of all possible generalizations
    lattice = build_generalization_lattice(quasi_identifiers)

    # Bottom-up lattice traversal
    for level in lattice.levels():
        for node in level:
            generalized_data = apply_generalization(data, node)
            if is_k_anonymous(generalized_data, k):
                return generalized_data, node

    return None  # k-anonymity unachievable

How it works:

  • Recursive attribute-space partitioning
  • Multi-dimensional generalization
  • Information loss optimization

Pseudocode:

def mondrian(data, k):
    if len(data) < 2*k:
        return generalize(data)

    # Choose attribute for splitting
    split_attr = choose_dimension(data)

    # Split data
    left, right = partition(data, split_attr)

    # Recursive processing
    return mondrian(left, k) + mondrian(right, k)

Quality Metrics

Information Loss

  • General Loss Metric (GLM): accuracy loss during generalization
  • Discernibility Metric (DM): equivalence class sizes
  • Normalized Certainty Penalty (NCP): normalized loss

Privacy

  • Privacy Level: minimum k across all classes
  • Diversity Measure: number of distinct sensitive values
  • Closeness Index: maximum deviation from overall distribution

Parameter Selection

Recommended for web analytics:

  • k ≥ 5 for basic user data protection
  • l ≥ 2 for behavioral attributes
  • t ≤ 0.3 for critical metrics

Selection factors:

  • Dataset size
  • Quasi-identifier count
  • Information sensitivity
  • Accuracy requirements

Limitations

Scalability

Big Data

High computational complexity with large datasets.

High Dimensions

More quasi-identifiers means higher information loss to reach the same privacy level.

Dynamic Data

Streaming

Traditional methods need adaptation and tradeoffs.

Updates

New records can break anonymity, requiring full recompilation.

Modern Attacks

Downcoding Attack

Exploit deterministic anonymization to recover original data.

Composition Attacks

Combine multiple anonymous releases to recover personal information.

Statable studied these models for analytics. Combined k-anonymity, l-diversity, and t-closeness gives the best balance of privacy and analytical value.

graph TD
    A[Web User Data] --> B[Attribute Classification]
    B --> C[Apply K-anonymity]
    C --> D[Check L-diversity]
    D --> E[Control T-closeness]
    E --> F[Anonymized Data]
    F --> G[Safe Analytics]

Group-based anonymization protects individual privacy while keeping data useful for analysis. Properly applied, it's a core tool in the privacy toolkit.

About AI participation in writing articles

This article, like many others on our site, was created, written and proofread by a team of developers. Of course, not without the participation of AI assistants. We don't hide this and believe that modern systems are already quite good at handling simple tasks and, relatively speaking, writing an article about Viewport yourself is quite strange. It won't come out significantly better and will take a lot of time. But providing basic understanding to beginner webmasters is necessary. Of course, after the article is written by assistants - there's always proofreading, and this is where not one or two people participate, and only after that the article is published.

Need Help with Group-based Anonymization?

Our analytics platform provides built-in algorithms for k-anonymity, l-diversity, and t-closeness. Ensure user data protection while maintaining analytical capabilities.


Ready to take control of your web analytics? Try Statable free for 30 days — no credit card required, full feature access, GDPR-compliant by default. Start your free trial or view a live demo.