Skip to content

Data Quality Fundamentals

Bad data leads to bad decisions and lost money. Companies lose 15-25% of revenue to data quality issues. Wrong addresses, duplicate records, stale prices, mismatched accounting.

What Data Quality Means

Data quality is fitness for purpose. Data is quality if it accurately represents reality and supports the decisions you need to make.

Definition

Data quality is relative, not absolute. Data perfect for one task may fail another.

Why It Matters

Financial:

  • Marketing campaigns fail on bad customer data
  • Error correction costs add up
  • Regulatory fines
  • Lost revenue from wrong forecasts

Operational:

  • Slower processes
  • Longer decision cycles
  • Duplicated work
  • Lost trust between teams

Strategic:

  • Wrong strategies
  • Missed opportunities
  • Lost competitive edge
  • Reputation damage

Four Quality Dimensions

1. Freshness

How current is the data when you use it?

What Affects Freshness

  • Update frequency
  • Lag between event and data reflection
  • Match to business requirements
  • Impact of stale data on decisions

Measuring Freshness

graph TD
    A[Event Occurs] -->|Collection Lag| B[Data Collected]
    B -->|Processing Lag| C[Data Processed]
    C -->|Loading Lag| D[Data Available]
    D -->|Usage Lag| E[Data Used]

    style A fill:#f9f,stroke:#333,stroke-width:2px
    style E fill:#9f9,stroke:#333,stroke-width:2px

Metrics:

MetricDescriptionFormulaTarget
Data AgeAge of dataCurrent Time - Record Creation Time< 24 hours
Update FrequencyUpdate rateNumber of Updates / Time PeriodDepends on SLA
LatencyData delayAvailability Time - Event Time< 1 hour
Currency RatePercent of current data(Current Records / All Records) × 100%> 95%

Freshness Strategies

Streaming:

  • Apache Kafka for event collection
  • Stream processing
  • Microservice architecture
  • Event-driven architecture

Use cases:

  • Financial transactions
  • System monitoring
  • Fraud detection

Micro-batching:

  • Process every 5-15 minutes
  • Balance speed and cost
  • Change Data Capture (CDC)
  • Incremental loads

Use cases:

  • Operational dashboards
  • Inventory management
  • Marketing campaigns

Periodic:

  • Nightly loads
  • ETL processes
  • Full refreshes
  • Scheduled jobs

Use cases:

  • Financial reporting
  • Analytics reports
  • Historical data

2. Completeness

All necessary records present, all required fields populated.

Three Levels

Completeness Levels

Schema:

  • All needed tables?
  • All required columns?
  • All relationships defined?

Record:

  • All needed records loaded?
  • Missing transactions?
  • All time periods covered?

Attribute:

  • Required fields filled?
  • Percent of NULLs?
  • Default values present?

Calculations

Attribute-level:

Completeness = (Non-null Values / Total Records) × 100%

Record-level:

Record Completeness = (Records with All Required Fields / All Records) × 100%

Example:

FieldRequiredPopulated% Complete
Customer IDYes10,000100%
EmailYes9,85098.5%
PhoneYes8,20082%
Middle NameNo3,50035%
Social MediaNo2,10021%

Improving Completeness

Prevention:

  • Validate at entry
  • Required fields in forms
  • Load-time checks
  • Reject incomplete records

Correction:

  • Enrich from external sources
  • Recover missing values
  • Use defaults
  • Statistical imputation

3. Accuracy

How well does data match reality?

Types

Format compliance:

  • Correct date formats
  • Valid data types
  • Pattern matching
  • Valid ranges

Example checks:

-- Email format check
WHERE email REGEXP '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}$'

-- Age range check
WHERE age BETWEEN 0 AND 120

Match to reality:

  • Correct attribute values
  • Current information
  • Accurate calculations
  • Precise aggregations

Verification:

  • Cross-check with reference sources
  • Cross-validate between systems
  • Expert review
  • Statistical outlier analysis

Measuring Accuracy

Categorical metrics:

Accuracy = (Correct Values / All Values) × 100%
Precision = True Positives / (True Positives + False Positives)
Recall = True Positives / (True Positives + False Negatives)
F1 Score = 2 × (Precision × Recall) / (Precision + Recall)

Numerical metrics:

MAE (Mean Absolute Error) = Σ|Actual - Expected| / n
RMSE (Root Mean Square Error) = √(Σ(Actual - Expected)² / n)
MAPE (Mean Absolute Percentage Error) = (Σ|Error / Actual| / n) × 100%

Sources of Inaccuracy

Common Causes

Human:

  • Entry errors
  • Misinterpretation
  • Typos

System:

  • ETL bugs
  • Wrong transformations
  • Encoding problems

Temporal:

  • Outdated reference data
  • Changed business rules
  • Unsynchronized updates

4. Consistency

No contradictions, within or across systems.

Types

Internal:

  • Within a record
  • Within a table
  • Between related tables

External:

  • Between systems
  • With business rules
  • With regulations

Temporal:

  • Across time
  • Logical event sequence
  • Chronology

Examples of Violations

TypeExampleImpact
FormatDate: "01.05.2024" vs "2024-05-01"Processing errors
SemanticStatus: "Active" in CRM, "Closed" in billingWrong decisions
ReferentialOrder without existing customerIntegrity violations
TemporalDelivery date before order dateLogical errors
ComputationalDetail sum ≠ total sumFinancial discrepancies

Ensuring Consistency

graph LR
    A[Source 1] --> D[Reconciliation Rules]
    B[Source 2] --> D
    C[Source 3] --> D
    D --> E[Master Data]
    E --> F[Consistent Data]

    style D fill:#ffd,stroke:#333,stroke-width:2px
    style E fill:#ddf,stroke:#333,stroke-width:2px

Strategies:

Prevent inconsistency:

  • Master data management (MDM)
  • Standardize formats
  • Validate input
  • Transactional integrity
  • Synchronous replication

Fix inconsistency:

  • Regular reconciliation
  • Auto-normalization
  • Reconciliation processes
  • Data cleansing
  • Source prioritization

How Dimensions Interact

The four dimensions affect each other:

Mutual Impact Matrix

DimensionFreshnessCompletenessAccuracyConsistency
Freshness.New data may be incompleteFast loading reduces accuracyAsync violates consistency
CompletenessWaiting reduces freshness.Imputation reduces accuracyDifferent completeness across systems
AccuracyChecking takes timeAccurate data is usually more complete.Corrections may break consistency
ConsistencySync takes timeReconciliation may delete dataUnification may reduce accuracy.

Practical Implementation

Quality Control Architecture

graph TB
    subgraph "Data Sources"
        S1[Database]
        S2[API]
        S3[Files]
        S4[Streaming]
    end

    subgraph "Data Quality Layer"
        P[Profiling]
        V[Validation]
        C[Cleansing]
        M[Monitoring]
    end

    subgraph "Storage"
        R[Raw Data]
        Q[Quality Metrics]
        Cl[Clean Data]
    end

    subgraph "Consumers"
        A[Analytics]
        Re[Reports]
        ML[Machine Learning]
    end

    S1 --> P
    S2 --> P
    S3 --> P
    S4 --> P

    P --> V
    V --> C
    C --> M

    P --> R
    V --> Q
    C --> Cl
    M --> Q

    Cl --> A
    Cl --> Re
    Cl --> ML

Automation

DQ Automation Framework

Level 1: Basic checks (daily)

  • NULL value checks
  • Data type validation
  • Duplicate checks
  • Range controls

Level 2: Business rules (weekly)

  • Cross-system checks
  • Business logic validation
  • Referential integrity
  • Trend and anomaly analysis

Level 3: Deep analysis (monthly)

  • Statistical distribution
  • ML for anomaly detection
  • Historical pattern comparison
  • Predictive quality analysis

Tools

Open source:

  • Great Expectations, validation framework
  • Apache Griffin, quality platform
  • Deequ, quality library for Spark
  • dbt tests, built-in tests in dbt

Commercial:

  • Informatica Data Quality
  • IBM InfoSphere QualityStage
  • Talend Data Quality
  • Collibra Data Quality

Cloud:

  • AWS Glue DataBrew
  • Google Cloud Data Quality
  • Azure Purview
  • Snowflake Data Quality

Metrics and KPIs

Composite Quality Index

DQI = (w₁ × Freshness + w₂ × Completeness + w₃ × Accuracy + w₄ × Consistency) / Σw

Where w is the importance weight for each dimension.

Example:

DimensionWeightScoreWeighted Score
Freshness0.385%25.5
Completeness0.292%18.4
Accuracy0.488%35.2
Consistency0.195%9.5
DQI1.0.88.6%

Dashboard

Key Dashboard Elements

Operational:

  • Current DQI by system
  • Critical errors count
  • Time since last update
  • Failed check percentage

Trends:

  • DQI over time
  • Top 10 problem tables
  • Error type distribution
  • Remediation progress

Details:

  • Drill-down by dimension
  • Specific issues list
  • Change history
  • Owners and SLAs

Roles and Process

Roles

Data Steward:

  • Set quality standards
  • Monitor metrics
  • Coordinate improvements
  • Escalate issues

Data Engineer:

  • Implement checks
  • Automate processes
  • Fix technical issues
  • Optimize pipelines

Business Analyst:

  • Define business rules
  • Validate results
  • Prioritize fixes
  • Communicate with business

Data Consumer:

  • Report issues
  • Quality feedback
  • Help define requirements
  • Validate fixes

Process

graph TD
    A[Define Requirements] --> B[Profile Data]
    B --> C{Meets Requirements?}
    C -->|Yes| D[Monitor]
    C -->|No| E[Analyze Issues]
    E --> F[Remediation Plan]
    F --> G[Implementation]
    G --> H[Validation]
    H --> C
    D --> I[Reporting]
    I --> J[Continuous Improvement]
    J --> A

Economics

ROI from Quality

Direct benefits:

  • 15-20% lower operational costs
  • 30% faster decisions
  • 40-50% fewer reporting errors
  • 20-25% better marketing effectiveness

Indirect benefits:

  • Higher data trust
  • Better reputation
  • Faster digital transformation
  • Lower regulatory risk

Cost of Low Quality

CoLQ = Cost of Correction + Cost of Rework + Lost Opportunity + Reputation Loss

Typical costs:

Category% of RevenueExamples
Operational losses5-8%Duplicated work, manual checks
Lost opportunities3-5%Wrong decisions, lost customers
Compliance risks2-4%Fines, audits, legal
IT costs1-3%Data fixes, system rework

Best Practices

Implementation Checklist

Phase 1: Assessment (1-2 months)

  • Audit current data state
  • Identify critical assets
  • Measure baseline metrics
  • Identify main issues
  • Assess business impact

Phase 2: Planning (2-3 weeks)

  • Set target metrics
  • Develop strategy
  • Assign owners
  • Set priorities
  • Build roadmap

Phase 3: Implementation (3-6 months)

  • Deploy automated checks
  • Set up monitoring and alerts
  • Build remediation processes
  • Train the team
  • Run pilots

Phase 4: Optimization (ongoing)

  • Analyze results
  • Adjust processes
  • Scale to new areas
  • Apply advanced techniques
  • Build quality culture

Common Mistakes

Top 5 DQ Mistakes

1. Solving everything at once

  • Mistake: Cover all systems and data simultaneously
  • Fix: Start with critical data, expand gradually

2. Technology only

  • Mistake: Buy expensive tools, skip processes
  • Fix: 70% process, 30% technology

3. Ignoring business

  • Mistake: IT-driven without business
  • Fix: Close collaboration with business users

4. No measurable goals

  • Mistake: Improve quality "in general"
  • Fix: Specific KPIs with baselines and targets

5. Underestimating change

  • Mistake: Quality as one-time project
  • Fix: Quality as continuous process and culture

What's Next

AI/ML automation:

  • Auto anomaly detection
  • Predictive quality
  • Self-healing pipelines
  • Intelligent imputation

Real-time quality:

  • Streaming checks
  • Proactive prevention
  • Continuous profiling
  • Instant feedback

DataOps:

  • Quality as Code
  • CI/CD for data
  • Automated testing
  • Data version control

New Technologies

Blockchain:

  • Immutable audit trail
  • Decentralized validation
  • Smart contracts for quality
  • Trusted sources

Federated Learning:

  • Quality checks without centralization
  • Privacy-preserving validation
  • Distributed metrics
  • Collaborative improvement

About AI participation in writing articles

This article, like many others on our site, was created, written and proofread by a team of developers. Of course, not without the participation of AI assistants. We don't hide this and believe that modern systems are already quite good at handling simple tasks and, relatively speaking, writing an article about Viewport yourself is quite strange. It won't come out significantly better and will take a lot of time. But providing basic understanding to beginner webmasters is necessary. Of course, after the article is written by assistants - there's always proofreading, and this is where not one or two people participate, and only after that the article is published.

Ready to Improve Your Data Quality?

Try Statable free. Built-in quality control, automated validation, monitoring across all four dimensions, and intelligent recommendations.


Ready to take control of your web analytics? Try Statable free for 30 days — no credit card required, full feature access, GDPR-compliant by default. Start your free trial or view a live demo.