Incremental Data Modeling: Building Data Models That Scale

Incremental data modeling processes only new or changed data rather than full refreshes. Learn the patterns, techniques, and trade-offs for building data models that scale efficiently with data volume.

7 min read·

Incremental data modeling is an approach where data transformations process only new or changed data rather than recomputing everything from scratch. As data volumes grow, full refresh approaches become impractical - taking too long, costing too much, or failing to meet freshness requirements. Incremental modeling enables data pipelines to scale by doing only necessary work.

The concept is simple - don't recompute what hasn't changed. The implementation is nuanced, involving change detection, idempotency, state management, and handling edge cases that full refreshes naturally avoid.

Why Incremental Matters

Consider a daily sales summary model. With 100 million historical orders:

Full refresh approach:

CREATE TABLE daily_sales AS
SELECT order_date,
       SUM(amount) as total_sales,
       COUNT(*) as order_count
FROM orders
GROUP BY order_date;

Every day, this processes all 100 million orders - even though only today's orders are new. As data grows, this takes longer and costs more.

Incremental approach:

-- Only process today's orders
INSERT INTO daily_sales
SELECT order_date,
       SUM(amount) as total_sales,
       COUNT(*) as order_count
FROM orders
WHERE order_date = CURRENT_DATE
GROUP BY order_date;

Process 100K new orders instead of 100M total orders - three orders of magnitude improvement.

Core Patterns

Append-Only Incremental

The simplest pattern - new records are added without modifying existing records.

-- Check for new records since last run
INSERT INTO target_table
SELECT *
FROM source_table
WHERE created_at > (SELECT MAX(created_at) FROM target_table);

Works well for:

  • Immutable event data (clicks, page views, transactions)
  • Fact tables that don't update
  • Append-only sources

Limitations:

  • Doesn't handle updates to existing records
  • Requires reliable timestamp or sequence for change detection
  • Late-arriving data may be missed

Merge/Upsert Incremental

Handles both inserts and updates by matching records on keys:

MERGE INTO target_table t
USING (
  SELECT * FROM source_table
  WHERE updated_at > @last_run_timestamp
) s
ON t.id = s.id
WHEN MATCHED THEN
  UPDATE SET t.name = s.name, t.amount = s.amount, t.updated_at = s.updated_at
WHEN NOT MATCHED THEN
  INSERT (id, name, amount, updated_at)
  VALUES (s.id, s.name, s.amount, s.updated_at);

Works well for:

  • Dimension tables with Type 1 updates
  • Entities that can be modified
  • Sources with reliable updated_at timestamps

Limitations:

  • Deletes not handled (requires additional logic)
  • Merge performance can be poor for large update volumes
  • Requires unique key for matching

Delete and Replace Partitions

For partitioned tables, replace affected partitions entirely:

-- Delete today's partition
DELETE FROM daily_sales WHERE order_date = CURRENT_DATE;

-- Rebuild today's partition from source
INSERT INTO daily_sales
SELECT order_date, SUM(amount), COUNT(*)
FROM orders
WHERE order_date = CURRENT_DATE
GROUP BY order_date;

Works well for:

  • Time-partitioned fact tables
  • When partitions are the natural update boundary
  • When change detection within partitions is complex

Benefits:

  • Idempotent - running twice produces same result
  • Handles late-arriving data within partition window
  • Simpler than row-level change tracking

Incremental Aggregations

Updating aggregates incrementally rather than recomputing:

-- Update running totals incrementally
UPDATE customer_totals ct
SET total_orders = ct.total_orders + delta.new_orders,
    total_amount = ct.total_amount + delta.new_amount
FROM (
  SELECT customer_id,
         COUNT(*) as new_orders,
         SUM(amount) as new_amount
  FROM orders
  WHERE order_date = CURRENT_DATE
  GROUP BY customer_id
) delta
WHERE ct.customer_id = delta.customer_id;

Works well for:

  • Running totals and cumulative metrics
  • High-frequency updates
  • When recomputing full aggregates is expensive

Challenges:

  • Handling updates and deletes to source data
  • Ensuring correctness when incremental logic has bugs
  • Debugging discrepancies

Change Detection Techniques

Timestamp-Based

Use updated_at or created_at columns:

SELECT * FROM source
WHERE updated_at > @last_successful_run;

Pros: Simple, widely supported Cons: Requires reliable timestamps, misses changes without timestamp updates

Change Data Capture (CDC)

Read database transaction logs to capture all changes:

INSERT order_123 {customer: 'ABC', amount: 100}
UPDATE order_456 {amount: 150}  -- was 145
DELETE order_789

Pros: Captures all changes including deletes, exact change record Cons: Requires CDC infrastructure, more complex to process

Hash Comparison

Compare hashes of source and target rows:

SELECT s.*
FROM source s
LEFT JOIN (
  SELECT id, MD5(CONCAT(col1, col2, col3)) as row_hash
  FROM target
) t ON s.id = t.id
WHERE MD5(CONCAT(s.col1, s.col2, s.col3)) != t.row_hash
   OR t.id IS NULL;

Pros: Detects any change, works without timestamps Cons: Requires full scan of target, compute-intensive

Sequence-Based

Use monotonically increasing sequence numbers:

SELECT * FROM source
WHERE sequence_number > @last_processed_sequence;

Pros: Reliable ordering, no timestamp clock issues Cons: Requires source to provide sequences

State Management

Incremental models require tracking state - what has been processed, up to when, and whether runs succeeded.

High Water Marks

Track the maximum processed value:

CREATE TABLE pipeline_state (
  model_name VARCHAR(100),
  last_processed_timestamp TIMESTAMP,
  last_processed_sequence BIGINT,
  last_run_status VARCHAR(20),
  last_run_at TIMESTAMP
);

Before each run:

  1. Read high water mark
  2. Process records after high water mark
  3. On success, update high water mark
  4. On failure, leave high water mark unchanged

Idempotency

Design models to produce correct results even when run multiple times:

-- Idempotent: delete then insert
DELETE FROM daily_stats WHERE stat_date = @process_date;
INSERT INTO daily_stats SELECT ...;

-- Not idempotent: just insert (duplicates if run twice)
INSERT INTO daily_stats SELECT ...;

Idempotency enables safe reruns after failures without data corruption.

Transaction Boundaries

Ensure state updates and data updates are atomic:

BEGIN TRANSACTION;
  INSERT INTO target_table SELECT ... FROM source WHERE updated_at > @hwm;
  UPDATE pipeline_state SET last_processed_timestamp = @current_run;
COMMIT;

If either fails, both roll back - preventing state/data inconsistency.

Handling Edge Cases

Late-Arriving Data

Data that arrives after its business timestamp:

-- Include lookback window for late arrivals
SELECT * FROM source
WHERE updated_at >= @last_run - INTERVAL '3 days';

Trade-off: larger lookback catches more late data but reprocesses more records.

Out-of-Order Processing

When parallel pipelines process data out of sequence:

  • Use partition-based processing (each partition independent)
  • Or sequence numbers with gap detection
  • Or eventual consistency with periodic full reconciliation

Source Schema Changes

When source columns are added, modified, or removed:

  • Version your extraction schemas
  • Handle unknown columns gracefully
  • Alert on schema changes for human review

Deleted Records

Detecting deletions requires:

  • CDC that captures deletes
  • Full scan comparison (expensive)
  • Soft delete markers in source
  • Periodic reconciliation runs

AI Analytics Implications

Incremental modeling affects AI systems in several ways.

Data Freshness

Incremental enables faster updates - hourly or even minute-level freshness. AI systems can answer questions about current state rather than yesterday's state.

Consistency During Updates

During incremental updates, queries may see partially updated data. Considerations:

  • Use atomic partition swaps where possible
  • Document expected consistency model
  • Consider snapshot isolation for critical queries

Historical Accuracy

Incremental models with bugs can accumulate errors over time. Unlike full refresh (which self-corrects), incremental errors persist until detected and fixed.

Mitigations:

  • Periodic full refresh reconciliation
  • Data quality monitoring
  • Automated comparison to source aggregates

Semantic Layer Documentation

Document incremental update patterns in the semantic layer:

model:
  name: daily_sales_summary
  update_pattern: incremental_by_date
  freshness_sla: hourly
  lookback_window: 3_days
  full_refresh_schedule: weekly

This helps AI systems understand data currency and set appropriate user expectations.

Best Practices

Start Simple

Begin with full refresh. Add incremental only when:

  • Full refresh takes too long
  • Freshness requirements demand it
  • Cost savings justify complexity

Monitor Aggressively

Track:

  • Record counts (source vs. target)
  • Aggregate comparisons
  • Late-arriving data volumes
  • Processing times

Catch incremental drift before it becomes significant.

Plan for Full Refresh

Always maintain the ability to do full refresh:

  • For initial load
  • For recovery from bugs
  • For periodic reconciliation
  • For schema changes

Don't let incremental-only dependencies block recovery.

Document Change Contracts

Document what changes are expected and how they're detected:

  • What timestamp column to use
  • Expected update patterns
  • Late-arrival characteristics
  • Delete handling approach

This documentation feeds semantic layer definitions and helps AI understand data behavior.

Incremental data modeling is essential for scalable analytics, but it's not free - it trades complexity for performance. Apply it where the trade-off makes sense, maintain robust monitoring, and always have a full refresh fallback.

Questions

Incremental loading refers to extracting only new or changed source data. Incremental modeling extends this concept to transformations - computing only what's necessary to update the model given the new data. You can have incremental loading with full model rebuilds, but true incremental modeling minimizes transformation work too.

Related