Data Quality Fundamentals
Bad data leads to bad decisions and lost money. Companies lose 15-25% of revenue to data quality issues. Wrong addresses, duplicate records, stale prices, mismatched accounting.
What Data Quality Means
Data quality is fitness for purpose. Data is quality if it accurately represents reality and supports the decisions you need to make.
Definition
Data quality is relative, not absolute. Data perfect for one task may fail another.
Why It Matters
Financial:
- Marketing campaigns fail on bad customer data
- Error correction costs add up
- Regulatory fines
- Lost revenue from wrong forecasts
Operational:
- Slower processes
- Longer decision cycles
- Duplicated work
- Lost trust between teams
Strategic:
- Wrong strategies
- Missed opportunities
- Lost competitive edge
- Reputation damage
Four Quality Dimensions
1. Freshness
How current is the data when you use it?
What Affects Freshness
- Update frequency
- Lag between event and data reflection
- Match to business requirements
- Impact of stale data on decisions
Measuring Freshness
graph TD
A[Event Occurs] -->|Collection Lag| B[Data Collected]
B -->|Processing Lag| C[Data Processed]
C -->|Loading Lag| D[Data Available]
D -->|Usage Lag| E[Data Used]
style A fill:#f9f,stroke:#333,stroke-width:2px
style E fill:#9f9,stroke:#333,stroke-width:2pxMetrics:
| Metric | Description | Formula | Target |
|---|---|---|---|
| Data Age | Age of data | Current Time - Record Creation Time | < 24 hours |
| Update Frequency | Update rate | Number of Updates / Time Period | Depends on SLA |
| Latency | Data delay | Availability Time - Event Time | < 1 hour |
| Currency Rate | Percent of current data | (Current Records / All Records) × 100% | > 95% |
Freshness Strategies
Streaming:
- Apache Kafka for event collection
- Stream processing
- Microservice architecture
- Event-driven architecture
Use cases:
- Financial transactions
- System monitoring
- Fraud detection
Micro-batching:
- Process every 5-15 minutes
- Balance speed and cost
- Change Data Capture (CDC)
- Incremental loads
Use cases:
- Operational dashboards
- Inventory management
- Marketing campaigns
Periodic:
- Nightly loads
- ETL processes
- Full refreshes
- Scheduled jobs
Use cases:
- Financial reporting
- Analytics reports
- Historical data
2. Completeness
All necessary records present, all required fields populated.
Three Levels
Completeness Levels
Schema:
- All needed tables?
- All required columns?
- All relationships defined?
Record:
- All needed records loaded?
- Missing transactions?
- All time periods covered?
Attribute:
- Required fields filled?
- Percent of NULLs?
- Default values present?
Calculations
Attribute-level:
Record-level:
Example:
| Field | Required | Populated | % Complete |
|---|---|---|---|
| Customer ID | Yes | 10,000 | 100% |
| Yes | 9,850 | 98.5% | |
| Phone | Yes | 8,200 | 82% |
| Middle Name | No | 3,500 | 35% |
| Social Media | No | 2,100 | 21% |
Improving Completeness
Prevention:
- Validate at entry
- Required fields in forms
- Load-time checks
- Reject incomplete records
Correction:
- Enrich from external sources
- Recover missing values
- Use defaults
- Statistical imputation
3. Accuracy
How well does data match reality?
Types
Format compliance:
- Correct date formats
- Valid data types
- Pattern matching
- Valid ranges
Example checks:
Match to reality:
- Correct attribute values
- Current information
- Accurate calculations
- Precise aggregations
Verification:
- Cross-check with reference sources
- Cross-validate between systems
- Expert review
- Statistical outlier analysis
Measuring Accuracy
Categorical metrics:
Accuracy = (Correct Values / All Values) × 100%
Precision = True Positives / (True Positives + False Positives)
Recall = True Positives / (True Positives + False Negatives)
F1 Score = 2 × (Precision × Recall) / (Precision + Recall)
Numerical metrics:
MAE (Mean Absolute Error) = Σ|Actual - Expected| / n
RMSE (Root Mean Square Error) = √(Σ(Actual - Expected)² / n)
MAPE (Mean Absolute Percentage Error) = (Σ|Error / Actual| / n) × 100%
Sources of Inaccuracy
Common Causes
Human:
- Entry errors
- Misinterpretation
- Typos
System:
- ETL bugs
- Wrong transformations
- Encoding problems
Temporal:
- Outdated reference data
- Changed business rules
- Unsynchronized updates
4. Consistency
No contradictions, within or across systems.
Types
Internal:
- Within a record
- Within a table
- Between related tables
External:
- Between systems
- With business rules
- With regulations
Temporal:
- Across time
- Logical event sequence
- Chronology
Examples of Violations
| Type | Example | Impact |
|---|---|---|
| Format | Date: "01.05.2024" vs "2024-05-01" | Processing errors |
| Semantic | Status: "Active" in CRM, "Closed" in billing | Wrong decisions |
| Referential | Order without existing customer | Integrity violations |
| Temporal | Delivery date before order date | Logical errors |
| Computational | Detail sum ≠ total sum | Financial discrepancies |
Ensuring Consistency
graph LR
A[Source 1] --> D[Reconciliation Rules]
B[Source 2] --> D
C[Source 3] --> D
D --> E[Master Data]
E --> F[Consistent Data]
style D fill:#ffd,stroke:#333,stroke-width:2px
style E fill:#ddf,stroke:#333,stroke-width:2pxStrategies:
Prevent inconsistency:
- Master data management (MDM)
- Standardize formats
- Validate input
- Transactional integrity
- Synchronous replication
Fix inconsistency:
- Regular reconciliation
- Auto-normalization
- Reconciliation processes
- Data cleansing
- Source prioritization
How Dimensions Interact
The four dimensions affect each other:
Mutual Impact Matrix
| Dimension | Freshness | Completeness | Accuracy | Consistency |
|---|---|---|---|---|
| Freshness | . | New data may be incomplete | Fast loading reduces accuracy | Async violates consistency |
| Completeness | Waiting reduces freshness | . | Imputation reduces accuracy | Different completeness across systems |
| Accuracy | Checking takes time | Accurate data is usually more complete | . | Corrections may break consistency |
| Consistency | Sync takes time | Reconciliation may delete data | Unification may reduce accuracy | . |
Practical Implementation
Quality Control Architecture
graph TB
subgraph "Data Sources"
S1[Database]
S2[API]
S3[Files]
S4[Streaming]
end
subgraph "Data Quality Layer"
P[Profiling]
V[Validation]
C[Cleansing]
M[Monitoring]
end
subgraph "Storage"
R[Raw Data]
Q[Quality Metrics]
Cl[Clean Data]
end
subgraph "Consumers"
A[Analytics]
Re[Reports]
ML[Machine Learning]
end
S1 --> P
S2 --> P
S3 --> P
S4 --> P
P --> V
V --> C
C --> M
P --> R
V --> Q
C --> Cl
M --> Q
Cl --> A
Cl --> Re
Cl --> MLAutomation
DQ Automation Framework
Level 1: Basic checks (daily)
- NULL value checks
- Data type validation
- Duplicate checks
- Range controls
Level 2: Business rules (weekly)
- Cross-system checks
- Business logic validation
- Referential integrity
- Trend and anomaly analysis
Level 3: Deep analysis (monthly)
- Statistical distribution
- ML for anomaly detection
- Historical pattern comparison
- Predictive quality analysis
Tools
Open source:
- Great Expectations, validation framework
- Apache Griffin, quality platform
- Deequ, quality library for Spark
- dbt tests, built-in tests in dbt
Commercial:
- Informatica Data Quality
- IBM InfoSphere QualityStage
- Talend Data Quality
- Collibra Data Quality
Cloud:
- AWS Glue DataBrew
- Google Cloud Data Quality
- Azure Purview
- Snowflake Data Quality
Metrics and KPIs
Composite Quality Index
Where w is the importance weight for each dimension.
Example:
| Dimension | Weight | Score | Weighted Score |
|---|---|---|---|
| Freshness | 0.3 | 85% | 25.5 |
| Completeness | 0.2 | 92% | 18.4 |
| Accuracy | 0.4 | 88% | 35.2 |
| Consistency | 0.1 | 95% | 9.5 |
| DQI | 1.0 | . | 88.6% |
Dashboard
Key Dashboard Elements
Operational:
- Current DQI by system
- Critical errors count
- Time since last update
- Failed check percentage
Trends:
- DQI over time
- Top 10 problem tables
- Error type distribution
- Remediation progress
Details:
- Drill-down by dimension
- Specific issues list
- Change history
- Owners and SLAs
Roles and Process
Roles
Data Steward:
- Set quality standards
- Monitor metrics
- Coordinate improvements
- Escalate issues
Data Engineer:
- Implement checks
- Automate processes
- Fix technical issues
- Optimize pipelines
Business Analyst:
- Define business rules
- Validate results
- Prioritize fixes
- Communicate with business
Data Consumer:
- Report issues
- Quality feedback
- Help define requirements
- Validate fixes
Process
graph TD
A[Define Requirements] --> B[Profile Data]
B --> C{Meets Requirements?}
C -->|Yes| D[Monitor]
C -->|No| E[Analyze Issues]
E --> F[Remediation Plan]
F --> G[Implementation]
G --> H[Validation]
H --> C
D --> I[Reporting]
I --> J[Continuous Improvement]
J --> AEconomics
ROI from Quality
Direct benefits:
- 15-20% lower operational costs
- 30% faster decisions
- 40-50% fewer reporting errors
- 20-25% better marketing effectiveness
Indirect benefits:
- Higher data trust
- Better reputation
- Faster digital transformation
- Lower regulatory risk
Cost of Low Quality
Typical costs:
| Category | % of Revenue | Examples |
|---|---|---|
| Operational losses | 5-8% | Duplicated work, manual checks |
| Lost opportunities | 3-5% | Wrong decisions, lost customers |
| Compliance risks | 2-4% | Fines, audits, legal |
| IT costs | 1-3% | Data fixes, system rework |
Best Practices
Implementation Checklist
Phase 1: Assessment (1-2 months)
- Audit current data state
- Identify critical assets
- Measure baseline metrics
- Identify main issues
- Assess business impact
Phase 2: Planning (2-3 weeks)
- Set target metrics
- Develop strategy
- Assign owners
- Set priorities
- Build roadmap
Phase 3: Implementation (3-6 months)
- Deploy automated checks
- Set up monitoring and alerts
- Build remediation processes
- Train the team
- Run pilots
Phase 4: Optimization (ongoing)
- Analyze results
- Adjust processes
- Scale to new areas
- Apply advanced techniques
- Build quality culture
Common Mistakes
Top 5 DQ Mistakes
1. Solving everything at once
- Mistake: Cover all systems and data simultaneously
- Fix: Start with critical data, expand gradually
2. Technology only
- Mistake: Buy expensive tools, skip processes
- Fix: 70% process, 30% technology
3. Ignoring business
- Mistake: IT-driven without business
- Fix: Close collaboration with business users
4. No measurable goals
- Mistake: Improve quality "in general"
- Fix: Specific KPIs with baselines and targets
5. Underestimating change
- Mistake: Quality as one-time project
- Fix: Quality as continuous process and culture
What's Next
Trends
AI/ML automation:
- Auto anomaly detection
- Predictive quality
- Self-healing pipelines
- Intelligent imputation
Real-time quality:
- Streaming checks
- Proactive prevention
- Continuous profiling
- Instant feedback
DataOps:
- Quality as Code
- CI/CD for data
- Automated testing
- Data version control
New Technologies
Blockchain:
- Immutable audit trail
- Decentralized validation
- Smart contracts for quality
- Trusted sources
Federated Learning:
- Quality checks without centralization
- Privacy-preserving validation
- Distributed metrics
- Collaborative improvement
About AI participation in writing articles
This article, like many others on our site, was created, written and proofread by a team of developers. Of course, not without the participation of AI assistants. We don't hide this and believe that modern systems are already quite good at handling simple tasks and, relatively speaking, writing an article about Viewport yourself is quite strange. It won't come out significantly better and will take a lot of time. But providing basic understanding to beginner webmasters is necessary. Of course, after the article is written by assistants - there's always proofreading, and this is where not one or two people participate, and only after that the article is published.
Ready to Improve Your Data Quality?
Try Statable free. Built-in quality control, automated validation, monitoring across all four dimensions, and intelligent recommendations.
Ready to take control of your web analytics? Try Statable free for 30 days — no credit card required, full feature access, GDPR-compliant by default. Start your free trial or view a live demo.