Data Quality Fundamentals

Bad data leads to bad decisions and lost money. Companies lose 15-25% of revenue to data quality issues. Wrong addresses, duplicate records, stale prices, mismatched accounting.

What Data Quality Means

Data quality is fitness for purpose. Data is quality if it accurately represents reality and supports the decisions you need to make.

Definition

Data quality is relative, not absolute. Data perfect for one task may fail another.

Why It Matters

Financial:

Marketing campaigns fail on bad customer data
Error correction costs add up
Regulatory fines
Lost revenue from wrong forecasts

Operational:

Slower processes
Longer decision cycles
Duplicated work
Lost trust between teams

Strategic:

Wrong strategies
Missed opportunities
Lost competitive edge
Reputation damage

Four Quality Dimensions

1. Freshness

How current is the data when you use it?

What Affects Freshness

Update frequency
Lag between event and data reflection
Match to business requirements
Impact of stale data on decisions

Measuring Freshness

graph TD
    A[Event Occurs] -->|Collection Lag| B[Data Collected]
    B -->|Processing Lag| C[Data Processed]
    C -->|Loading Lag| D[Data Available]
    D -->|Usage Lag| E[Data Used]

    style A fill:#f9f,stroke:#333,stroke-width:2px
    style E fill:#9f9,stroke:#333,stroke-width:2px

Metrics:

Metric	Description	Formula	Target
Data Age	Age of data	`Current Time - Record Creation Time`	< 24 hours
Update Frequency	Update rate	`Number of Updates / Time Period`	Depends on SLA
Latency	Data delay	`Availability Time - Event Time`	< 1 hour
Currency Rate	Percent of current data	`(Current Records / All Records) × 100%`	> 95%

Freshness Strategies

Real-time ProcessingNear Real-timeBatch Processing

Streaming:

Apache Kafka for event collection
Stream processing
Microservice architecture
Event-driven architecture

Use cases:

Financial transactions
System monitoring
Fraud detection

Micro-batching:

Process every 5-15 minutes
Balance speed and cost
Change Data Capture (CDC)
Incremental loads

Use cases:

Operational dashboards
Inventory management
Marketing campaigns

Periodic:

Nightly loads
ETL processes
Full refreshes
Scheduled jobs

Use cases:

Financial reporting
Analytics reports
Historical data

2. Completeness

All necessary records present, all required fields populated.

Three Levels

Completeness Levels

Schema:

All needed tables?
All required columns?
All relationships defined?

Record:

All needed records loaded?
Missing transactions?
All time periods covered?

Attribute:

Required fields filled?
Percent of NULLs?
Default values present?

Calculations

Attribute-level:

Completeness = (Non-null Values / Total Records) × 100%

Record-level:

Record Completeness = (Records with All Required Fields / All Records) × 100%

Example:

Field	Required	Populated	% Complete
Customer ID	Yes	10,000	100%
Email	Yes	9,850	98.5%
Phone	Yes	8,200	82%
Middle Name	No	3,500	35%
Social Media	No	2,100	21%

Improving Completeness

Prevention:

Validate at entry
Required fields in forms
Load-time checks
Reject incomplete records

Correction:

Enrich from external sources
Recover missing values
Use defaults
Statistical imputation

3. Accuracy

How well does data match reality?

Types

Syntactic AccuracySemantic Accuracy

Format compliance:

Correct date formats
Valid data types
Pattern matching
Valid ranges

Example checks:

-- Email format check
WHERE email REGEXP '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}$'

-- Age range check
WHERE age BETWEEN 0 AND 120

Match to reality:

Correct attribute values
Current information
Accurate calculations
Precise aggregations

Verification:

Cross-check with reference sources
Cross-validate between systems
Expert review
Statistical outlier analysis

Measuring Accuracy

Categorical metrics:

Accuracy = (Correct Values / All Values) × 100%
Precision = True Positives / (True Positives + False Positives)
Recall = True Positives / (True Positives + False Negatives)
F1 Score = 2 × (Precision × Recall) / (Precision + Recall)

Numerical metrics:

MAE (Mean Absolute Error) = Σ|Actual - Expected| / n
RMSE (Root Mean Square Error) = √(Σ(Actual - Expected)² / n)
MAPE (Mean Absolute Percentage Error) = (Σ|Error / Actual| / n) × 100%

Sources of Inaccuracy

Common Causes

Human:

Entry errors
Misinterpretation
Typos

System:

ETL bugs
Wrong transformations
Encoding problems

Temporal:

Outdated reference data
Changed business rules
Unsynchronized updates

4. Consistency

No contradictions, within or across systems.

Types

Internal:

Within a record
Within a table
Between related tables

External:

Between systems
With business rules
With regulations

Temporal:

Across time
Logical event sequence
Chronology

Examples of Violations

Type	Example	Impact
Format	Date: "01.05.2024" vs "2024-05-01"	Processing errors
Semantic	Status: "Active" in CRM, "Closed" in billing	Wrong decisions
Referential	Order without existing customer	Integrity violations
Temporal	Delivery date before order date	Logical errors
Computational	Detail sum ≠ total sum	Financial discrepancies

Ensuring Consistency

graph LR
    A[Source 1] --> D[Reconciliation Rules]
    B[Source 2] --> D
    C[Source 3] --> D
    D --> E[Master Data]
    E --> F[Consistent Data]

    style D fill:#ffd,stroke:#333,stroke-width:2px
    style E fill:#ddf,stroke:#333,stroke-width:2px

Strategies:

ProactiveReactive

Prevent inconsistency:

Master data management (MDM)
Standardize formats
Validate input
Transactional integrity
Synchronous replication

Fix inconsistency:

Regular reconciliation
Auto-normalization
Reconciliation processes
Data cleansing
Source prioritization

How Dimensions Interact

The four dimensions affect each other:

Mutual Impact Matrix

Dimension	Freshness	Completeness	Accuracy	Consistency
Freshness	.	New data may be incomplete	Fast loading reduces accuracy	Async violates consistency
Completeness	Waiting reduces freshness	.	Imputation reduces accuracy	Different completeness across systems
Accuracy	Checking takes time	Accurate data is usually more complete	.	Corrections may break consistency
Consistency	Sync takes time	Reconciliation may delete data	Unification may reduce accuracy	.

Practical Implementation

Quality Control Architecture

graph TB
    subgraph "Data Sources"
        S1[Database]
        S2[API]
        S3[Files]
        S4[Streaming]
    end

    subgraph "Data Quality Layer"
        P[Profiling]
        V[Validation]
        C[Cleansing]
        M[Monitoring]
    end

    subgraph "Storage"
        R[Raw Data]
        Q[Quality Metrics]
        Cl[Clean Data]
    end

    subgraph "Consumers"
        A[Analytics]
        Re[Reports]
        ML[Machine Learning]
    end

    S1 --> P
    S2 --> P
    S3 --> P
    S4 --> P

    P --> V
    V --> C
    C --> M

    P --> R
    V --> Q
    C --> Cl
    M --> Q

    Cl --> A
    Cl --> Re
    Cl --> ML

Automation

DQ Automation Framework

Level 1: Basic checks (daily)

NULL value checks
Data type validation
Duplicate checks
Range controls

Level 2: Business rules (weekly)

Cross-system checks
Business logic validation
Referential integrity
Trend and anomaly analysis

Level 3: Deep analysis (monthly)

Statistical distribution
ML for anomaly detection
Historical pattern comparison
Predictive quality analysis

Tools

Open source:

Great Expectations, validation framework
Apache Griffin, quality platform
Deequ, quality library for Spark
dbt tests, built-in tests in dbt

Commercial:

Informatica Data Quality
IBM InfoSphere QualityStage
Talend Data Quality
Collibra Data Quality

Cloud:

AWS Glue DataBrew
Google Cloud Data Quality
Azure Purview
Snowflake Data Quality

Metrics and KPIs

Composite Quality Index

DQI = (w₁ × Freshness + w₂ × Completeness + w₃ × Accuracy + w₄ × Consistency) / Σw

Where w is the importance weight for each dimension.

Example:

Dimension	Weight	Score	Weighted Score
Freshness	0.3	85%	25.5
Completeness	0.2	92%	18.4
Accuracy	0.4	88%	35.2
Consistency	0.1	95%	9.5
DQI	1.0	.	88.6%

Dashboard

Key Dashboard Elements

Operational:

Current DQI by system
Critical errors count
Time since last update
Failed check percentage

Trends:

DQI over time
Top 10 problem tables
Error type distribution
Remediation progress

Details:

Drill-down by dimension
Specific issues list
Change history
Owners and SLAs

Roles and Process

Roles

Data Steward:

Set quality standards
Monitor metrics
Coordinate improvements
Escalate issues

Data Engineer:

Implement checks
Automate processes
Fix technical issues
Optimize pipelines

Business Analyst:

Define business rules
Validate results
Prioritize fixes
Communicate with business

Data Consumer:

Report issues
Quality feedback
Help define requirements
Validate fixes

Process

graph TD
    A[Define Requirements] --> B[Profile Data]
    B --> C{Meets Requirements?}
    C -->|Yes| D[Monitor]
    C -->|No| E[Analyze Issues]
    E --> F[Remediation Plan]
    F --> G[Implementation]
    G --> H[Validation]
    H --> C
    D --> I[Reporting]
    I --> J[Continuous Improvement]
    J --> A

Economics

ROI from Quality

Direct benefits:

15-20% lower operational costs
30% faster decisions
40-50% fewer reporting errors
20-25% better marketing effectiveness

Indirect benefits:

Higher data trust
Better reputation
Faster digital transformation
Lower regulatory risk

Cost of Low Quality

CoLQ = Cost of Correction + Cost of Rework + Lost Opportunity + Reputation Loss

Typical costs:

Category	% of Revenue	Examples
Operational losses	5-8%	Duplicated work, manual checks
Lost opportunities	3-5%	Wrong decisions, lost customers
Compliance risks	2-4%	Fines, audits, legal
IT costs	1-3%	Data fixes, system rework

Best Practices

Implementation Checklist

Phase 1: Assessment (1-2 months)

Phase 2: Planning (2-3 weeks)

Phase 3: Implementation (3-6 months)

Phase 4: Optimization (ongoing)

Common Mistakes

Top 5 DQ Mistakes

1. Solving everything at once

Mistake: Cover all systems and data simultaneously
Fix: Start with critical data, expand gradually

2. Technology only

Mistake: Buy expensive tools, skip processes
Fix: 70% process, 30% technology

3. Ignoring business

Mistake: IT-driven without business
Fix: Close collaboration with business users

4. No measurable goals

Mistake: Improve quality "in general"
Fix: Specific KPIs with baselines and targets

5. Underestimating change

Mistake: Quality as one-time project
Fix: Quality as continuous process and culture

What's Next

Trends

AI/ML automation:

Auto anomaly detection
Predictive quality
Self-healing pipelines
Intelligent imputation

Real-time quality:

Streaming checks
Proactive prevention
Continuous profiling
Instant feedback

DataOps:

Quality as Code
CI/CD for data
Automated testing
Data version control

New Technologies

Blockchain:

Immutable audit trail
Decentralized validation
Smart contracts for quality
Trusted sources

Federated Learning:

Quality checks without centralization
Privacy-preserving validation
Distributed metrics
Collaborative improvement

About AI participation in writing articles

This article, like many others on our site, was created, written and proofread by a team of developers. Of course, not without the participation of AI assistants. We don't hide this and believe that modern systems are already quite good at handling simple tasks and, relatively speaking, writing an article about Viewport yourself is quite strange. It won't come out significantly better and will take a lot of time. But providing basic understanding to beginner webmasters is necessary. Of course, after the article is written by assistants - there's always proofreading, and this is where not one or two people participate, and only after that the article is published.

Ready to Improve Your Data Quality?

Try Statable free. Built-in quality control, automated validation, monitoring across all four dimensions, and intelligent recommendations.

Ready to take control of your web analytics? Try Statable free for 30 days — no credit card required, full feature access, GDPR-compliant by default. Start your free trial or view a live demo.