Data Lineage, Catalog, Validation, and Schema
Modern organizations manage petabytes across hundreds of systems. Without the right tools this becomes chaos. Nobody knows where data came from, how it changed, or whether it's trustworthy. Lineage, catalogs, validation, and schema management create a transparent, manageable ecosystem.
Data Lineage
Lineage tracks the full data lifecycle. Source to final use. Every transformation, every move between systems.
Levels of Granularity
Three Lineage Levels
System-level
- Movement between systems
- ETL/ELT processes
- Sources and sinks
- Pipeline dependencies
Table-level
- Relationships between tables
- JOIN operations and aggregations
- Schema versioning
- Change history
Column-level
- Field transformations
- Business logic
- Attribute mapping
- Formulas and calculations
Visualization
graph TB
subgraph "Source Systems"
S1[CRM Database]
S2[ERP System]
S3[Web Analytics]
end
subgraph "Data Lake"
R1[Raw Zone]
R2[Cleaned Zone]
R3[Curated Zone]
end
subgraph "Data Warehouse"
D1[Staging]
D2[Core DWH]
D3[Data Marts]
end
subgraph "Analytics"
A1[BI Reports]
A2[ML Models]
A3[Dashboards]
end
S1 --> R1
S2 --> R1
S3 --> R1
R1 --> R2
R2 --> R3
R3 --> D1
D1 --> D2
D2 --> D3
D3 --> A1
D3 --> A2
D3 --> A3
style S1 fill:#f9f,stroke:#333,stroke-width:2px
style A3 fill:#9f9,stroke:#333,stroke-width:2pxCollection Methods
SQL parsing:
-- Analyzing queries to extract dependencies
CREATE TABLE sales_summary AS
SELECT
c.customer_id,
c.customer_name,
SUM(o.amount) as total_sales
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name
Result: - Sources: customers, orders - Target: sales_summary - Column mappings: - customers.customer_id → sales_summary.customer_id - orders.amount → sales_summary.total_sales
Execution interception:
- Query log monitoring
- Execution plan analysis
- Data flow tracing
- Real-time metadata capture
Pros:
- 100% accuracy
- Covers dynamic queries
- Actual data paths
- Performance metrics
Explicit definitions:
lineage:
source:
- table: raw.customer_events
columns: [user_id, event_type, timestamp]
transformation:
- type: filter
condition: "event_type = 'purchase'"
- type: aggregate
group_by: [user_id]
metrics: [count(*) as purchase_count]
target:
table: analytics.user_purchases
columns: [user_id, purchase_count]
Use Cases
Impact analysis. What breaks when sources change?
| Scenario | Question | Lineage Answer |
|---|---|---|
| Schema change | What breaks if we drop a column? | All dependent objects |
| System migration | Which reports use this DB? | Graph of consumers |
| Incident | Which data is affected? | Problem propagation path |
| Compliance | Where is PII stored? | Full PII map |
Root cause analysis. Trace issues back to their source.
Data Catalog
A catalog is a centralized metadata repository. It powers search, understanding, and governance.
Architecture
Core Components
Metadata store:
- Technical (schemas, types, statistics)
- Business (descriptions, owners, tags)
- Operational (usage, quality, SLA)
- Social (ratings, comments, FAQ)
Discovery layer:
- Full-text search
- Faceted navigation
- Recommendations
- Semantic search
Governance layer:
- Classification
- Access policies
- Retention rules
- Compliance tracking
Automated Population
Metadata harvesting:
# Example of automatic metadata collection
def harvest_metadata(connection):
metadata = {
'tables': [],
'columns': [],
'statistics': {}
}
# Extract schema
tables = connection.get_tables()
for table in tables:
metadata['tables'].append({
'name': table.name,
'schema': table.schema,
'row_count': table.count(),
'size_mb': table.size_mb(),
'last_modified': table.last_modified
})
# Profile columns
for column in table.columns:
metadata['columns'].append({
'table': table.name,
'column': column.name,
'data_type': column.type,
'nullable': column.nullable,
'unique_values': column.distinct_count(),
'null_percentage': column.null_ratio()
})
return metadata
Business Glossary
Map technical terms to business meaning:
| Technical Term | Business Term | Definition | Owner |
|---|---|---|---|
cust_id | Customer ID | Unique customer identifier | Sales Dept |
arr_value | Annual Recurring Revenue | Yearly recurring revenue | Finance |
churn_flag | Customer Churn | Customer attrition indicator | Marketing |
ltv_score | Lifetime Value | Predicted customer value | Analytics |
Modern Capabilities
Search features:
- Natural language queries
- Metadata filters
- Lineage search
- Similarity search
Example:
Social features:
- Comments and discussions
- Q&A
- Quality ratings
- Change subscriptions
- Knowledge sharing
Crowdsourced curation:
- User tags
- Business descriptions
- Usage examples
- Best practices
AI/ML capabilities:
- Auto-classification
- PII detection
- Similar dataset suggestions
- Metadata anomalies
- Usage prediction
Data Validation
Validation ensures data meets the rules before use.
Validation Levels
graph TD
A[Incoming Data] --> B{Syntactic Validation}
B -->|Pass| C{Semantic Validation}
B -->|Fail| E[Rejection]
C -->|Pass| D{Business Rules}
C -->|Fail| E
D -->|Pass| F[Accepted Data]
D -->|Fail| G[Quarantine]
style F fill:#9f9,stroke:#333,stroke-width:2px
style E fill:#f99,stroke:#333,stroke-width:2px
style G fill:#ff9,stroke:#333,stroke-width:2pxCheck Types
Validation Hierarchy
Structural:
- Schema compliance
- Data types
- Required fields
- Value formats
Content:
- Value ranges
- Patterns and regex
- Uniqueness
- Referential integrity
Statistical:
- Distribution
- Outliers and anomalies
- Temporal patterns
- Correlations
Implementation
Declarative with Great Expectations:
# Define expectations for dataset
expectations = {
"expect_table_row_count_to_be_between": {
"min_value": 1000,
"max_value": 1000000
},
"expect_column_values_to_not_be_null": {
"column": "customer_id"
},
"expect_column_values_to_be_between": {
"column": "age",
"min_value": 0,
"max_value": 120
},
"expect_column_values_to_match_regex": {
"column": "email",
"regex": r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
},
"expect_column_pair_values_to_be_equal": {
"column_A": "debit_amount",
"column_B": "credit_amount"
}
}
SQL-based with dbt tests:
-- tests/assert_positive_amounts.sql
SELECT *
FROM {{ ref('transactions') }}
WHERE amount < 0
AND transaction_type = 'sale'
-- Test fails if query returns records
Handling Invalid Data
| Strategy | Description | When to Use |
|---|---|---|
| Reject | Drop the record | Critical errors |
| Quarantine | Isolate for manual review | Suspicious data |
| Default | Replace with default value | Non-critical gaps |
| Impute | Statistical recovery | Analytical datasets |
| Flag | Mark and load | Quality monitoring |
| Transform | Auto-correct | Known patterns |
Schema Management
Schema management controls data structure and evolution. It keeps systems compatible as things change.
Schema Registry
Centralized schema storage with versioning:
Schema Registry Architecture
graph TB
subgraph "Producers"
P1[Service A]
P2[Service B]
P3[Service C]
end
subgraph "Schema Registry"
SR[Registry Service]
SV[Version Store]
SC[Compatibility Checker]
end
subgraph "Consumers"
C1[Analytics]
C2[ML Pipeline]
C3[Real-time App]
end
P1 -->|Register v1| SR
P2 -->|Register v2| SR
P3 -->|Check compatibility| SC
SR --> SV
SC --> SV
C1 -->|Get schema| SR
C2 -->|Get schema| SR
C3 -->|Get schema| SRCompatibility Strategies
New version reads old data:
// Version 1
{
"user_id": 123,
"name": "John Doe"
}
// Version 2 (added optional field)
{
"user_id": 123,
"name": "John Doe",
"email": "[email protected]" // optional
}
Allowed: - Add optional fields - Remove required fields
Old version reads new data:
// Version 1 (with extra field)
{
"user_id": 123,
"name": "John Doe",
"phone": "+1234567890"
}
// Version 2 (removed field)
{
"user_id": 123,
"name": "John Doe"
}
Allowed: - Remove optional fields - Add required fields
Bidirectional:
- Backward + Forward
- Only add or remove optional fields
- Strictest strategy
- Any version works
Schema Evolution
Migrations:
-- Migration with versioning
-- Version 1.0
CREATE TABLE users (
user_id INT PRIMARY KEY,
username VARCHAR(50) NOT NULL
);
-- Version 1.1: Add email
ALTER TABLE users
ADD COLUMN email VARCHAR(100);
-- Version 1.2: Rename column
ALTER TABLE users
RENAME COLUMN username TO user_name;
-- Version 2.0: Add partitioning
ALTER TABLE users
PARTITION BY RANGE (created_date);
Automatic migration:
class SchemaMigration:
def __init__(self, from_version, to_version):
self.from_version = from_version
self.to_version = to_version
def migrate(self, data):
migrations = {
('1.0', '1.1'): self.add_default_email,
('1.1', '1.2'): self.rename_username,
('1.2', '2.0'): self.add_partitioning
}
migration_func = migrations.get(
(self.from_version, self.to_version)
)
if migration_func:
return migration_func(data)
else:
raise ValueError(f"No migration path from {self.from_version} to {self.to_version}")
Putting It Together
Unified Governance Platform
graph TB
subgraph "Data Sources"
DS1[Databases]
DS2[APIs]
DS3[Files]
DS4[Streams]
end
subgraph "Metadata Platform"
CAT[Data Catalog]
LIN[Lineage Engine]
VAL[Validation Framework]
SCH[Schema Registry]
CAT <--> LIN
CAT <--> SCH
LIN <--> VAL
SCH <--> VAL
end
subgraph "Capabilities"
DIS[Discovery]
GOV[Governance]
QUA[Quality]
COM[Compliance]
end
DS1 --> CAT
DS2 --> CAT
DS3 --> CAT
DS4 --> SCH
CAT --> DIS
LIN --> GOV
VAL --> QUA
SCH --> COMCI/CD Automation
DataOps Pipeline
Continuous integration for data:
Commit schema changes
Automatic checks
- Compatibility check
- Impact analysis via lineage
- Validation rules update
Catalog metadata refresh
Deployment
- Schema migration
- Downstream notifications
- Documentation update
- Monitoring activation
Performance Metrics
Governance KPIs:
| Metric | Formula | Target | Why |
|---|---|---|---|
| Data Discovery Time | Average time to find a dataset | < 5 min | Analyst productivity |
| Lineage Coverage | Lineage-covered tables / All tables | > 90% | Transparency |
| Validation Pass Rate | Successful checks / All checks | > 95% | Data quality |
| Schema Drift | Unplanned changes / month | < 5 | System stability |
| Catalog Adoption | Active users / All users | > 80% | Platform usage |
Tools and Platforms
Open Source
Apache Atlas: - Comprehensive metadata management - Hadoop ecosystem integration - REST API and Kafka hooks - Classification and tagging
OpenMetadata: - Modern architecture - 90+ connectors - Built-in lineage - Collaboration features
DataHub (LinkedIn): - Scalable platform - Real-time metadata - GraphQL API - Stream-based architecture
Amundsen (Lyft): - Focus on data discovery - Neo4j metadata storage - Airflow integration - Simple UI
Enterprise
| Platform | Strengths | Limitations |
|---|---|---|
| Collibra | Governance workflows, broad functionality | Complex implementation, high cost |
| Informatica EDC | AI-driven discovery, deep automation | Vendor lock-in, ecosystem dependent |
| Alation | User-friendly UI, collaboration | Limited technical lineage |
| IBM Watson Knowledge Catalog | AI capabilities, IBM stack integration | Hard outside IBM environments |
| Microsoft Purview | Azure native, automatic lineage | Azure lock-in |
Choosing a Platform
Selection Criteria
Technical:
- Data source support
- Scalability and performance
- API and integrations
- Automation vs manual work
Organizational:
- Governance maturity
- Team size and expertise
- Budget and TCO
- Vendor strategy
Functional:
- Discovery vs governance
- Technical vs business users
- Real-time vs batch
- Cloud vs on-premise
Practical Implementation
Roadmap
Phase 1: Foundation (2-3 months)
- Inventory data sources
- Pick and deploy a platform
- Pilot critical datasets
- Basic validation rules
- Train key users
Phase 2: Expansion (3-6 months)
- Automated metadata collection
- End-to-end lineage
- Expanded validation rules
- CI/CD integration
- Onboard all teams
Phase 3: Maturity (6-12 months)
- Full catalog coverage
- Column-level lineage
- Predictive data quality
- Self-service governance
- Continuous improvement
Best Practices
Organizational:
- Executive sponsorship
- Cross-functional governance committee
- Clear ownership
- Regular training
- Success metrics
Technical:
- Automation first
- Incremental rollout
- Version control everything
- Monitoring and alerting
- Regular audits
Cultural:
- Data literacy programs
- Gamification for adoption
- Recognition for stewardship
- Process transparency
- Feedback loops
What's Next
Trends
Active metadata: - Real-time updates - Proactive recommendations - Automated actions - Context-aware assistance
AI-powered governance: - Auto-classification - Anomaly detection - Natural language queries - Predictive lineage
Federated architecture: - Decentralized governance - Data mesh principles - Domain ownership - Self-serve infrastructure
Modern Solutions
| Challenge | Old Way | New Way |
|---|---|---|
| Data scale | Manual cataloging | ML-driven discovery |
| Change velocity | Quarterly reviews | Real-time governance |
| Complexity | Centralized team | Federated ownership |
| Compliance | Reactive audits | Continuous monitoring |
| User adoption | Mandatory training | Intuitive UX + automation |
About AI participation in writing articles
This article, like many others on our site, was created, written and proofread by a team of developers. Of course, not without the participation of AI assistants. We don't hide this and believe that modern systems are already quite good at handling simple tasks and, relatively speaking, writing an article about Viewport yourself is quite strange. It won't come out significantly better and will take a lot of time. But providing basic understanding to beginner webmasters is necessary. Of course, after the article is written by assistants - there's always proofreading, and this is where not one or two people participate, and only after that the article is published.
Ready to Build a Managed Data Ecosystem?
Try Statable free. Built-in lineage, automatic cataloging, validation, and schema management. Full transparency from source to insight.
Ready to take control of your web analytics? Try Statable free for 30 days — no credit card required, full feature access, GDPR-compliant by default. Start your free trial or view a live demo.