Incremental Data Modeling: Building Data Models That Scale
Incremental data modeling processes only new or changed data rather than full refreshes. Learn the patterns, techniques, and trade-offs for building data models that scale efficiently with data volume.
Incremental data modeling is an approach where data transformations process only new or changed data rather than recomputing everything from scratch. As data volumes grow, full refresh approaches become impractical - taking too long, costing too much, or failing to meet freshness requirements. Incremental modeling enables data pipelines to scale by doing only necessary work.
The concept is simple - don't recompute what hasn't changed. The implementation is nuanced, involving change detection, idempotency, state management, and handling edge cases that full refreshes naturally avoid.
Why Incremental Matters
Consider a daily sales summary model. With 100 million historical orders:
Full refresh approach:
CREATE TABLE daily_sales AS
SELECT order_date,
SUM(amount) as total_sales,
COUNT(*) as order_count
FROM orders
GROUP BY order_date;
Every day, this processes all 100 million orders - even though only today's orders are new. As data grows, this takes longer and costs more.
Incremental approach:
-- Only process today's orders
INSERT INTO daily_sales
SELECT order_date,
SUM(amount) as total_sales,
COUNT(*) as order_count
FROM orders
WHERE order_date = CURRENT_DATE
GROUP BY order_date;
Process 100K new orders instead of 100M total orders - three orders of magnitude improvement.
Core Patterns
Append-Only Incremental
The simplest pattern - new records are added without modifying existing records.
-- Check for new records since last run
INSERT INTO target_table
SELECT *
FROM source_table
WHERE created_at > (SELECT MAX(created_at) FROM target_table);
Works well for:
- Immutable event data (clicks, page views, transactions)
- Fact tables that don't update
- Append-only sources
Limitations:
- Doesn't handle updates to existing records
- Requires reliable timestamp or sequence for change detection
- Late-arriving data may be missed
Merge/Upsert Incremental
Handles both inserts and updates by matching records on keys:
MERGE INTO target_table t
USING (
SELECT * FROM source_table
WHERE updated_at > @last_run_timestamp
) s
ON t.id = s.id
WHEN MATCHED THEN
UPDATE SET t.name = s.name, t.amount = s.amount, t.updated_at = s.updated_at
WHEN NOT MATCHED THEN
INSERT (id, name, amount, updated_at)
VALUES (s.id, s.name, s.amount, s.updated_at);
Works well for:
- Dimension tables with Type 1 updates
- Entities that can be modified
- Sources with reliable updated_at timestamps
Limitations:
- Deletes not handled (requires additional logic)
- Merge performance can be poor for large update volumes
- Requires unique key for matching
Delete and Replace Partitions
For partitioned tables, replace affected partitions entirely:
-- Delete today's partition
DELETE FROM daily_sales WHERE order_date = CURRENT_DATE;
-- Rebuild today's partition from source
INSERT INTO daily_sales
SELECT order_date, SUM(amount), COUNT(*)
FROM orders
WHERE order_date = CURRENT_DATE
GROUP BY order_date;
Works well for:
- Time-partitioned fact tables
- When partitions are the natural update boundary
- When change detection within partitions is complex
Benefits:
- Idempotent - running twice produces same result
- Handles late-arriving data within partition window
- Simpler than row-level change tracking
Incremental Aggregations
Updating aggregates incrementally rather than recomputing:
-- Update running totals incrementally
UPDATE customer_totals ct
SET total_orders = ct.total_orders + delta.new_orders,
total_amount = ct.total_amount + delta.new_amount
FROM (
SELECT customer_id,
COUNT(*) as new_orders,
SUM(amount) as new_amount
FROM orders
WHERE order_date = CURRENT_DATE
GROUP BY customer_id
) delta
WHERE ct.customer_id = delta.customer_id;
Works well for:
- Running totals and cumulative metrics
- High-frequency updates
- When recomputing full aggregates is expensive
Challenges:
- Handling updates and deletes to source data
- Ensuring correctness when incremental logic has bugs
- Debugging discrepancies
Change Detection Techniques
Timestamp-Based
Use updated_at or created_at columns:
SELECT * FROM source
WHERE updated_at > @last_successful_run;
Pros: Simple, widely supported Cons: Requires reliable timestamps, misses changes without timestamp updates
Change Data Capture (CDC)
Read database transaction logs to capture all changes:
INSERT order_123 {customer: 'ABC', amount: 100}
UPDATE order_456 {amount: 150} -- was 145
DELETE order_789
Pros: Captures all changes including deletes, exact change record Cons: Requires CDC infrastructure, more complex to process
Hash Comparison
Compare hashes of source and target rows:
SELECT s.*
FROM source s
LEFT JOIN (
SELECT id, MD5(CONCAT(col1, col2, col3)) as row_hash
FROM target
) t ON s.id = t.id
WHERE MD5(CONCAT(s.col1, s.col2, s.col3)) != t.row_hash
OR t.id IS NULL;
Pros: Detects any change, works without timestamps Cons: Requires full scan of target, compute-intensive
Sequence-Based
Use monotonically increasing sequence numbers:
SELECT * FROM source
WHERE sequence_number > @last_processed_sequence;
Pros: Reliable ordering, no timestamp clock issues Cons: Requires source to provide sequences
State Management
Incremental models require tracking state - what has been processed, up to when, and whether runs succeeded.
High Water Marks
Track the maximum processed value:
CREATE TABLE pipeline_state (
model_name VARCHAR(100),
last_processed_timestamp TIMESTAMP,
last_processed_sequence BIGINT,
last_run_status VARCHAR(20),
last_run_at TIMESTAMP
);
Before each run:
- Read high water mark
- Process records after high water mark
- On success, update high water mark
- On failure, leave high water mark unchanged
Idempotency
Design models to produce correct results even when run multiple times:
-- Idempotent: delete then insert
DELETE FROM daily_stats WHERE stat_date = @process_date;
INSERT INTO daily_stats SELECT ...;
-- Not idempotent: just insert (duplicates if run twice)
INSERT INTO daily_stats SELECT ...;
Idempotency enables safe reruns after failures without data corruption.
Transaction Boundaries
Ensure state updates and data updates are atomic:
BEGIN TRANSACTION;
INSERT INTO target_table SELECT ... FROM source WHERE updated_at > @hwm;
UPDATE pipeline_state SET last_processed_timestamp = @current_run;
COMMIT;
If either fails, both roll back - preventing state/data inconsistency.
Handling Edge Cases
Late-Arriving Data
Data that arrives after its business timestamp:
-- Include lookback window for late arrivals
SELECT * FROM source
WHERE updated_at >= @last_run - INTERVAL '3 days';
Trade-off: larger lookback catches more late data but reprocesses more records.
Out-of-Order Processing
When parallel pipelines process data out of sequence:
- Use partition-based processing (each partition independent)
- Or sequence numbers with gap detection
- Or eventual consistency with periodic full reconciliation
Source Schema Changes
When source columns are added, modified, or removed:
- Version your extraction schemas
- Handle unknown columns gracefully
- Alert on schema changes for human review
Deleted Records
Detecting deletions requires:
- CDC that captures deletes
- Full scan comparison (expensive)
- Soft delete markers in source
- Periodic reconciliation runs
AI Analytics Implications
Incremental modeling affects AI systems in several ways.
Data Freshness
Incremental enables faster updates - hourly or even minute-level freshness. AI systems can answer questions about current state rather than yesterday's state.
Consistency During Updates
During incremental updates, queries may see partially updated data. Considerations:
- Use atomic partition swaps where possible
- Document expected consistency model
- Consider snapshot isolation for critical queries
Historical Accuracy
Incremental models with bugs can accumulate errors over time. Unlike full refresh (which self-corrects), incremental errors persist until detected and fixed.
Mitigations:
- Periodic full refresh reconciliation
- Data quality monitoring
- Automated comparison to source aggregates
Semantic Layer Documentation
Document incremental update patterns in the semantic layer:
model:
name: daily_sales_summary
update_pattern: incremental_by_date
freshness_sla: hourly
lookback_window: 3_days
full_refresh_schedule: weekly
This helps AI systems understand data currency and set appropriate user expectations.
Best Practices
Start Simple
Begin with full refresh. Add incremental only when:
- Full refresh takes too long
- Freshness requirements demand it
- Cost savings justify complexity
Monitor Aggressively
Track:
- Record counts (source vs. target)
- Aggregate comparisons
- Late-arriving data volumes
- Processing times
Catch incremental drift before it becomes significant.
Plan for Full Refresh
Always maintain the ability to do full refresh:
- For initial load
- For recovery from bugs
- For periodic reconciliation
- For schema changes
Don't let incremental-only dependencies block recovery.
Document Change Contracts
Document what changes are expected and how they're detected:
- What timestamp column to use
- Expected update patterns
- Late-arrival characteristics
- Delete handling approach
This documentation feeds semantic layer definitions and helps AI understand data behavior.
Incremental data modeling is essential for scalable analytics, but it's not free - it trades complexity for performance. Apply it where the trade-off makes sense, maintain robust monitoring, and always have a full refresh fallback.
Questions
Incremental loading refers to extracting only new or changed source data. Incremental modeling extends this concept to transformations - computing only what's necessary to update the model given the new data. You can have incremental loading with full model rebuilds, but true incremental modeling minimizes transformation work too.