Skip to content

Data Lineage, Catalog, Validation, and Schema

Modern organizations manage petabytes across hundreds of systems. Without the right tools this becomes chaos. Nobody knows where data came from, how it changed, or whether it's trustworthy. Lineage, catalogs, validation, and schema management create a transparent, manageable ecosystem.

Data Lineage

Lineage tracks the full data lifecycle. Source to final use. Every transformation, every move between systems.

Levels of Granularity

Three Lineage Levels

System-level

  • Movement between systems
  • ETL/ELT processes
  • Sources and sinks
  • Pipeline dependencies

Table-level

  • Relationships between tables
  • JOIN operations and aggregations
  • Schema versioning
  • Change history

Column-level

  • Field transformations
  • Business logic
  • Attribute mapping
  • Formulas and calculations

Visualization

graph TB
    subgraph "Source Systems"
        S1[CRM Database]
        S2[ERP System]
        S3[Web Analytics]
    end

    subgraph "Data Lake"
        R1[Raw Zone]
        R2[Cleaned Zone]
        R3[Curated Zone]
    end

    subgraph "Data Warehouse"
        D1[Staging]
        D2[Core DWH]
        D3[Data Marts]
    end

    subgraph "Analytics"
        A1[BI Reports]
        A2[ML Models]
        A3[Dashboards]
    end

    S1 --> R1
    S2 --> R1
    S3 --> R1
    R1 --> R2
    R2 --> R3
    R3 --> D1
    D1 --> D2
    D2 --> D3
    D3 --> A1
    D3 --> A2
    D3 --> A3

    style S1 fill:#f9f,stroke:#333,stroke-width:2px
    style A3 fill:#9f9,stroke:#333,stroke-width:2px

Collection Methods

SQL parsing:

-- Analyzing queries to extract dependencies
CREATE TABLE sales_summary AS
SELECT 
    c.customer_id,
    c.customer_name,
    SUM(o.amount) as total_sales
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name

Result: - Sources: customers, orders - Target: sales_summary - Column mappings: - customers.customer_idsales_summary.customer_id - orders.amountsales_summary.total_sales

Execution interception:

  • Query log monitoring
  • Execution plan analysis
  • Data flow tracing
  • Real-time metadata capture

Pros:

  • 100% accuracy
  • Covers dynamic queries
  • Actual data paths
  • Performance metrics

Explicit definitions:

lineage:
  source:
    - table: raw.customer_events
      columns: [user_id, event_type, timestamp]
  transformation:
    - type: filter
      condition: "event_type = 'purchase'"
    - type: aggregate
      group_by: [user_id]
      metrics: [count(*) as purchase_count]
  target:
    table: analytics.user_purchases
    columns: [user_id, purchase_count]

Use Cases

Impact analysis. What breaks when sources change?

ScenarioQuestionLineage Answer
Schema changeWhat breaks if we drop a column?All dependent objects
System migrationWhich reports use this DB?Graph of consumers
IncidentWhich data is affected?Problem propagation path
ComplianceWhere is PII stored?Full PII map

Root cause analysis. Trace issues back to their source.

Data Catalog

A catalog is a centralized metadata repository. It powers search, understanding, and governance.

Architecture

Core Components

Metadata store:

  • Technical (schemas, types, statistics)
  • Business (descriptions, owners, tags)
  • Operational (usage, quality, SLA)
  • Social (ratings, comments, FAQ)

Discovery layer:

  • Full-text search
  • Faceted navigation
  • Recommendations
  • Semantic search

Governance layer:

  • Classification
  • Access policies
  • Retention rules
  • Compliance tracking

Automated Population

Metadata harvesting:

# Example of automatic metadata collection
def harvest_metadata(connection):
    metadata = {
        'tables': [],
        'columns': [],
        'statistics': {}
    }

    # Extract schema
    tables = connection.get_tables()
    for table in tables:
        metadata['tables'].append({
            'name': table.name,
            'schema': table.schema,
            'row_count': table.count(),
            'size_mb': table.size_mb(),
            'last_modified': table.last_modified
        })

        # Profile columns
        for column in table.columns:
            metadata['columns'].append({
                'table': table.name,
                'column': column.name,
                'data_type': column.type,
                'nullable': column.nullable,
                'unique_values': column.distinct_count(),
                'null_percentage': column.null_ratio()
            })

    return metadata

Business Glossary

Map technical terms to business meaning:

Technical TermBusiness TermDefinitionOwner
cust_idCustomer IDUnique customer identifierSales Dept
arr_valueAnnual Recurring RevenueYearly recurring revenueFinance
churn_flagCustomer ChurnCustomer attrition indicatorMarketing
ltv_scoreLifetime ValuePredicted customer valueAnalytics

Modern Capabilities

Search features:

  • Natural language queries
  • Metadata filters
  • Lineage search
  • Similarity search

Example:

Query: "customer revenue last quarter"

Results:
1. analytics.quarterly_revenue (table)
   Schema: customer_id, quarter, revenue
   Update: daily
   Owner: Finance Team

2. reports.customer_revenue_q4 (dashboard)
   Source: analytics.quarterly_revenue
   Last update: 2 hours ago
   Users: 127

Social features:

  • Comments and discussions
  • Q&A
  • Quality ratings
  • Change subscriptions
  • Knowledge sharing

Crowdsourced curation:

  • User tags
  • Business descriptions
  • Usage examples
  • Best practices

AI/ML capabilities:

  • Auto-classification
  • PII detection
  • Similar dataset suggestions
  • Metadata anomalies
  • Usage prediction

Data Validation

Validation ensures data meets the rules before use.

Validation Levels

graph TD
    A[Incoming Data] --> B{Syntactic Validation}
    B -->|Pass| C{Semantic Validation}
    B -->|Fail| E[Rejection]
    C -->|Pass| D{Business Rules}
    C -->|Fail| E
    D -->|Pass| F[Accepted Data]
    D -->|Fail| G[Quarantine]

    style F fill:#9f9,stroke:#333,stroke-width:2px
    style E fill:#f99,stroke:#333,stroke-width:2px
    style G fill:#ff9,stroke:#333,stroke-width:2px

Check Types

Validation Hierarchy

Structural:

  • Schema compliance
  • Data types
  • Required fields
  • Value formats

Content:

  • Value ranges
  • Patterns and regex
  • Uniqueness
  • Referential integrity

Statistical:

  • Distribution
  • Outliers and anomalies
  • Temporal patterns
  • Correlations

Implementation

Declarative with Great Expectations:

# Define expectations for dataset
expectations = {
    "expect_table_row_count_to_be_between": {
        "min_value": 1000,
        "max_value": 1000000
    },
    "expect_column_values_to_not_be_null": {
        "column": "customer_id"
    },
    "expect_column_values_to_be_between": {
        "column": "age",
        "min_value": 0,
        "max_value": 120
    },
    "expect_column_values_to_match_regex": {
        "column": "email",
        "regex": r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
    },
    "expect_column_pair_values_to_be_equal": {
        "column_A": "debit_amount",
        "column_B": "credit_amount"
    }
}

SQL-based with dbt tests:

-- tests/assert_positive_amounts.sql
SELECT *
FROM {{ ref('transactions') }}
WHERE amount < 0
  AND transaction_type = 'sale'

-- Test fails if query returns records

Handling Invalid Data

StrategyDescriptionWhen to Use
RejectDrop the recordCritical errors
QuarantineIsolate for manual reviewSuspicious data
DefaultReplace with default valueNon-critical gaps
ImputeStatistical recoveryAnalytical datasets
FlagMark and loadQuality monitoring
TransformAuto-correctKnown patterns

Schema Management

Schema management controls data structure and evolution. It keeps systems compatible as things change.

Schema Registry

Centralized schema storage with versioning:

Schema Registry Architecture

graph TB
    subgraph "Producers"
        P1[Service A]
        P2[Service B]
        P3[Service C]
    end

    subgraph "Schema Registry"
        SR[Registry Service]
        SV[Version Store]
        SC[Compatibility Checker]
    end

    subgraph "Consumers"
        C1[Analytics]
        C2[ML Pipeline]
        C3[Real-time App]
    end

    P1 -->|Register v1| SR
    P2 -->|Register v2| SR
    P3 -->|Check compatibility| SC

    SR --> SV
    SC --> SV

    C1 -->|Get schema| SR
    C2 -->|Get schema| SR
    C3 -->|Get schema| SR

Compatibility Strategies

New version reads old data:

// Version 1
{
  "user_id": 123,
  "name": "John Doe"
}

// Version 2 (added optional field)
{
  "user_id": 123,
  "name": "John Doe",
  "email": "[email protected]"  // optional
}

Allowed: - Add optional fields - Remove required fields

Old version reads new data:

// Version 1 (with extra field)
{
  "user_id": 123,
  "name": "John Doe",
  "phone": "+1234567890"
}

// Version 2 (removed field)
{
  "user_id": 123,
  "name": "John Doe"
}

Allowed: - Remove optional fields - Add required fields

Bidirectional:

  • Backward + Forward
  • Only add or remove optional fields
  • Strictest strategy
  • Any version works

Schema Evolution

Migrations:

-- Migration with versioning
-- Version 1.0
CREATE TABLE users (
    user_id INT PRIMARY KEY,
    username VARCHAR(50) NOT NULL
);

-- Version 1.1: Add email
ALTER TABLE users 
ADD COLUMN email VARCHAR(100);

-- Version 1.2: Rename column
ALTER TABLE users 
RENAME COLUMN username TO user_name;

-- Version 2.0: Add partitioning
ALTER TABLE users 
PARTITION BY RANGE (created_date);

Automatic migration:

class SchemaMigration:
    def __init__(self, from_version, to_version):
        self.from_version = from_version
        self.to_version = to_version

    def migrate(self, data):
        migrations = {
            ('1.0', '1.1'): self.add_default_email,
            ('1.1', '1.2'): self.rename_username,
            ('1.2', '2.0'): self.add_partitioning
        }

        migration_func = migrations.get(
            (self.from_version, self.to_version)
        )

        if migration_func:
            return migration_func(data)
        else:
            raise ValueError(f"No migration path from {self.from_version} to {self.to_version}")

Putting It Together

Unified Governance Platform

graph TB
    subgraph "Data Sources"
        DS1[Databases]
        DS2[APIs]
        DS3[Files]
        DS4[Streams]
    end

    subgraph "Metadata Platform"
        CAT[Data Catalog]
        LIN[Lineage Engine]
        VAL[Validation Framework]
        SCH[Schema Registry]

        CAT <--> LIN
        CAT <--> SCH
        LIN <--> VAL
        SCH <--> VAL
    end

    subgraph "Capabilities"
        DIS[Discovery]
        GOV[Governance]
        QUA[Quality]
        COM[Compliance]
    end

    DS1 --> CAT
    DS2 --> CAT
    DS3 --> CAT
    DS4 --> SCH

    CAT --> DIS
    LIN --> GOV
    VAL --> QUA
    SCH --> COM

CI/CD Automation

DataOps Pipeline

Continuous integration for data:

  1. Commit schema changes

    git add schema/users_v2.0.sql
    git commit -m "Add email field to users table"
    git push
    

  2. Automatic checks

  3. Compatibility check
  4. Impact analysis via lineage
  5. Validation rules update
  6. Catalog metadata refresh

  7. Deployment

  8. Schema migration
  9. Downstream notifications
  10. Documentation update
  11. Monitoring activation

Performance Metrics

Governance KPIs:

MetricFormulaTargetWhy
Data Discovery TimeAverage time to find a dataset< 5 minAnalyst productivity
Lineage CoverageLineage-covered tables / All tables> 90%Transparency
Validation Pass RateSuccessful checks / All checks> 95%Data quality
Schema DriftUnplanned changes / month< 5System stability
Catalog AdoptionActive users / All users> 80%Platform usage

Tools and Platforms

Open Source

Apache Atlas: - Comprehensive metadata management - Hadoop ecosystem integration - REST API and Kafka hooks - Classification and tagging

OpenMetadata: - Modern architecture - 90+ connectors - Built-in lineage - Collaboration features

DataHub (LinkedIn): - Scalable platform - Real-time metadata - GraphQL API - Stream-based architecture

Amundsen (Lyft): - Focus on data discovery - Neo4j metadata storage - Airflow integration - Simple UI

Enterprise

PlatformStrengthsLimitations
CollibraGovernance workflows, broad functionalityComplex implementation, high cost
Informatica EDCAI-driven discovery, deep automationVendor lock-in, ecosystem dependent
AlationUser-friendly UI, collaborationLimited technical lineage
IBM Watson Knowledge CatalogAI capabilities, IBM stack integrationHard outside IBM environments
Microsoft PurviewAzure native, automatic lineageAzure lock-in

Choosing a Platform

Selection Criteria

Technical:

  • Data source support
  • Scalability and performance
  • API and integrations
  • Automation vs manual work

Organizational:

  • Governance maturity
  • Team size and expertise
  • Budget and TCO
  • Vendor strategy

Functional:

  • Discovery vs governance
  • Technical vs business users
  • Real-time vs batch
  • Cloud vs on-premise

Practical Implementation

Roadmap

Phase 1: Foundation (2-3 months)

  • Inventory data sources
  • Pick and deploy a platform
  • Pilot critical datasets
  • Basic validation rules
  • Train key users

Phase 2: Expansion (3-6 months)

  • Automated metadata collection
  • End-to-end lineage
  • Expanded validation rules
  • CI/CD integration
  • Onboard all teams

Phase 3: Maturity (6-12 months)

  • Full catalog coverage
  • Column-level lineage
  • Predictive data quality
  • Self-service governance
  • Continuous improvement

Best Practices

Organizational:

  • Executive sponsorship
  • Cross-functional governance committee
  • Clear ownership
  • Regular training
  • Success metrics

Technical:

  • Automation first
  • Incremental rollout
  • Version control everything
  • Monitoring and alerting
  • Regular audits

Cultural:

  • Data literacy programs
  • Gamification for adoption
  • Recognition for stewardship
  • Process transparency
  • Feedback loops

What's Next

Active metadata: - Real-time updates - Proactive recommendations - Automated actions - Context-aware assistance

AI-powered governance: - Auto-classification - Anomaly detection - Natural language queries - Predictive lineage

Federated architecture: - Decentralized governance - Data mesh principles - Domain ownership - Self-serve infrastructure

Modern Solutions

ChallengeOld WayNew Way
Data scaleManual catalogingML-driven discovery
Change velocityQuarterly reviewsReal-time governance
ComplexityCentralized teamFederated ownership
ComplianceReactive auditsContinuous monitoring
User adoptionMandatory trainingIntuitive UX + automation

About AI participation in writing articles

This article, like many others on our site, was created, written and proofread by a team of developers. Of course, not without the participation of AI assistants. We don't hide this and believe that modern systems are already quite good at handling simple tasks and, relatively speaking, writing an article about Viewport yourself is quite strange. It won't come out significantly better and will take a lot of time. But providing basic understanding to beginner webmasters is necessary. Of course, after the article is written by assistants - there's always proofreading, and this is where not one or two people participate, and only after that the article is published.

Ready to Build a Managed Data Ecosystem?

Try Statable free. Built-in lineage, automatic cataloging, validation, and schema management. Full transparency from source to insight.


Ready to take control of your web analytics? Try Statable free for 30 days — no credit card required, full feature access, GDPR-compliant by default. Start your free trial or view a live demo.