What's the difference between incremental loading and incremental modeling?

Incremental loading refers to extracting only new or changed source data. Incremental modeling extends this concept to transformations - computing only what's necessary to update the model given the new data. You can have incremental loading with full model rebuilds, but true incremental modeling minimizes transformation work too.

How do I identify which records have changed?

Common approaches include: timestamp columns (updated_at), change data capture from database logs, hash comparison of row values, source-provided change flags, or sequence numbers. The best approach depends on what your source systems support. Timestamp columns are most common but can miss updates that don't change the timestamp.

Should all models be incremental?

No. Incremental adds complexity and risk. Use it when: data volumes make full refresh impractical, freshness requirements exceed full refresh speed, or compute costs of full refresh are prohibitive. For smaller tables or less time-sensitive data, full refresh is simpler and safer - you always get correct results even if change detection fails.

How do I handle late-arriving data in incremental models?

Late-arriving data - records that arrive after their business timestamp - requires special handling. Options include: lookback windows that reprocess recent periods, separate late data processing runs, or partition-level refresh for affected periods. The approach depends on how common late data is and its business impact.

Incremental Data Modeling: Building Data Models That Scale

Incremental data modeling is an approach where data transformations process only new or changed data rather than recomputing everything from scratch. As data volumes grow, full refresh approaches become impractical - taking too long, costing too much, or failing to meet freshness requirements. Incremental modeling enables data pipelines to scale by doing only necessary work.

The concept is simple - don't recompute what hasn't changed. The implementation is nuanced, involving change detection, idempotency, state management, and handling edge cases that full refreshes naturally avoid.

Why Incremental Matters

Consider a daily sales summary model. With 100 million historical orders:

Full refresh approach:

CREATE TABLE daily_sales AS
SELECT order_date,
       SUM(amount) as total_sales,
       COUNT(*) as order_count
FROM orders
GROUP BY order_date;

Every day, this processes all 100 million orders - even though only today's orders are new. As data grows, this takes longer and costs more.

Incremental approach:

-- Only process today's orders
INSERT INTO daily_sales
SELECT order_date,
       SUM(amount) as total_sales,
       COUNT(*) as order_count
FROM orders
WHERE order_date = CURRENT_DATE
GROUP BY order_date;

Process 100K new orders instead of 100M total orders - three orders of magnitude improvement.

Core Patterns

Append-Only Incremental

The simplest pattern - new records are added without modifying existing records.

-- Check for new records since last run
INSERT INTO target_table
SELECT *
FROM source_table
WHERE created_at > (SELECT MAX(created_at) FROM target_table);

Works well for:

Immutable event data (clicks, page views, transactions)
Fact tables that don't update
Append-only sources

Limitations:

Doesn't handle updates to existing records
Requires reliable timestamp or sequence for change detection
Late-arriving data may be missed

Merge/Upsert Incremental

Handles both inserts and updates by matching records on keys:

MERGE INTO target_table t
USING (
  SELECT * FROM source_table
  WHERE updated_at > @last_run_timestamp
) s
ON t.id = s.id
WHEN MATCHED THEN
  UPDATE SET t.name = s.name, t.amount = s.amount, t.updated_at = s.updated_at
WHEN NOT MATCHED THEN
  INSERT (id, name, amount, updated_at)
  VALUES (s.id, s.name, s.amount, s.updated_at);

Works well for:

Dimension tables with Type 1 updates
Entities that can be modified
Sources with reliable updated_at timestamps

Limitations:

Deletes not handled (requires additional logic)
Merge performance can be poor for large update volumes
Requires unique key for matching

Delete and Replace Partitions

For partitioned tables, replace affected partitions entirely:

-- Delete today's partition
DELETE FROM daily_sales WHERE order_date = CURRENT_DATE;

-- Rebuild today's partition from source
INSERT INTO daily_sales
SELECT order_date, SUM(amount), COUNT(*)
FROM orders
WHERE order_date = CURRENT_DATE
GROUP BY order_date;

Works well for:

Time-partitioned fact tables
When partitions are the natural update boundary
When change detection within partitions is complex

Benefits:

Idempotent - running twice produces same result
Handles late-arriving data within partition window
Simpler than row-level change tracking

Incremental Aggregations

Updating aggregates incrementally rather than recomputing:

-- Update running totals incrementally
UPDATE customer_totals ct
SET total_orders = ct.total_orders + delta.new_orders,
    total_amount = ct.total_amount + delta.new_amount
FROM (
  SELECT customer_id,
         COUNT(*) as new_orders,
         SUM(amount) as new_amount
  FROM orders
  WHERE order_date = CURRENT_DATE
  GROUP BY customer_id
) delta
WHERE ct.customer_id = delta.customer_id;

Works well for:

Running totals and cumulative metrics
High-frequency updates
When recomputing full aggregates is expensive

Challenges:

Handling updates and deletes to source data
Ensuring correctness when incremental logic has bugs
Debugging discrepancies

Change Detection Techniques

Timestamp-Based

Use updated_at or created_at columns:

SELECT * FROM source
WHERE updated_at > @last_successful_run;

Pros: Simple, widely supported Cons: Requires reliable timestamps, misses changes without timestamp updates

Change Data Capture (CDC)

Read database transaction logs to capture all changes:

INSERT order_123 {customer: 'ABC', amount: 100}
UPDATE order_456 {amount: 150}  -- was 145
DELETE order_789

Pros: Captures all changes including deletes, exact change record Cons: Requires CDC infrastructure, more complex to process

Hash Comparison

Compare hashes of source and target rows:

SELECT s.*
FROM source s
LEFT JOIN (
  SELECT id, MD5(CONCAT(col1, col2, col3)) as row_hash
  FROM target
) t ON s.id = t.id
WHERE MD5(CONCAT(s.col1, s.col2, s.col3)) != t.row_hash
   OR t.id IS NULL;

Pros: Detects any change, works without timestamps Cons: Requires full scan of target, compute-intensive

Sequence-Based

Use monotonically increasing sequence numbers:

SELECT * FROM source
WHERE sequence_number > @last_processed_sequence;

Pros: Reliable ordering, no timestamp clock issues Cons: Requires source to provide sequences

State Management

Incremental models require tracking state - what has been processed, up to when, and whether runs succeeded.

High Water Marks

Track the maximum processed value:

CREATE TABLE pipeline_state (
  model_name VARCHAR(100),
  last_processed_timestamp TIMESTAMP,
  last_processed_sequence BIGINT,
  last_run_status VARCHAR(20),
  last_run_at TIMESTAMP
);

Before each run:

Read high water mark
Process records after high water mark
On success, update high water mark
On failure, leave high water mark unchanged

Idempotency

Design models to produce correct results even when run multiple times:

-- Idempotent: delete then insert
DELETE FROM daily_stats WHERE stat_date = @process_date;
INSERT INTO daily_stats SELECT ...;

-- Not idempotent: just insert (duplicates if run twice)
INSERT INTO daily_stats SELECT ...;

Idempotency enables safe reruns after failures without data corruption.

Transaction Boundaries

Ensure state updates and data updates are atomic:

BEGIN TRANSACTION;
  INSERT INTO target_table SELECT ... FROM source WHERE updated_at > @hwm;
  UPDATE pipeline_state SET last_processed_timestamp = @current_run;
COMMIT;

If either fails, both roll back - preventing state/data inconsistency.

Handling Edge Cases

Late-Arriving Data

Data that arrives after its business timestamp:

-- Include lookback window for late arrivals
SELECT * FROM source
WHERE updated_at >= @last_run - INTERVAL '3 days';

Trade-off: larger lookback catches more late data but reprocesses more records.

Out-of-Order Processing

When parallel pipelines process data out of sequence:

Use partition-based processing (each partition independent)
Or sequence numbers with gap detection
Or eventual consistency with periodic full reconciliation

Source Schema Changes

When source columns are added, modified, or removed:

Version your extraction schemas
Handle unknown columns gracefully
Alert on schema changes for human review

Deleted Records

Detecting deletions requires:

CDC that captures deletes
Full scan comparison (expensive)
Soft delete markers in source
Periodic reconciliation runs

AI Analytics Implications

Incremental modeling affects AI systems in several ways.

Data Freshness

Incremental enables faster updates - hourly or even minute-level freshness. AI systems can answer questions about current state rather than yesterday's state.

Consistency During Updates

During incremental updates, queries may see partially updated data. Considerations:

Use atomic partition swaps where possible
Document expected consistency model
Consider snapshot isolation for critical queries

Historical Accuracy

Incremental models with bugs can accumulate errors over time. Unlike full refresh (which self-corrects), incremental errors persist until detected and fixed.

Mitigations:

Periodic full refresh reconciliation
Data quality monitoring
Automated comparison to source aggregates

Semantic Layer Documentation

Document incremental update patterns in the semantic layer:

model:
  name: daily_sales_summary
  update_pattern: incremental_by_date
  freshness_sla: hourly
  lookback_window: 3_days
  full_refresh_schedule: weekly

This helps AI systems understand data currency and set appropriate user expectations.

Best Practices

Start Simple

Begin with full refresh. Add incremental only when:

Full refresh takes too long
Freshness requirements demand it
Cost savings justify complexity

Monitor Aggressively

Track:

Record counts (source vs. target)
Aggregate comparisons
Late-arriving data volumes
Processing times

Catch incremental drift before it becomes significant.

Plan for Full Refresh

Always maintain the ability to do full refresh:

For initial load
For recovery from bugs
For periodic reconciliation
For schema changes

Don't let incremental-only dependencies block recovery.

Document Change Contracts

Document what changes are expected and how they're detected:

What timestamp column to use
Expected update patterns
Late-arrival characteristics
Delete handling approach

This documentation feeds semantic layer definitions and helps AI understand data behavior.

Incremental data modeling is essential for scalable analytics, but it's not free - it trades complexity for performance. Apply it where the trade-off makes sense, maintain robust monitoring, and always have a full refresh fallback.

Incremental Data Modeling: Building Data Models That Scale

Why Incremental Matters

Core Patterns

Append-Only Incremental

Merge/Upsert Incremental

Delete and Replace Partitions

Incremental Aggregations

Change Detection Techniques

Timestamp-Based

Change Data Capture (CDC)

Hash Comparison

Sequence-Based

State Management

High Water Marks

Idempotency

Transaction Boundaries

Handling Edge Cases

Late-Arriving Data

Out-of-Order Processing

Source Schema Changes

Deleted Records

AI Analytics Implications

Data Freshness

Consistency During Updates

Historical Accuracy

Semantic Layer Documentation

Best Practices

Start Simple

Monitor Aggressively

Plan for Full Refresh

Document Change Contracts

Questions

Related