What's the difference between real-time and near-real-time analytics?

Real-time typically means sub-second latency - seeing events as they happen. Near-real-time usually means minutes of latency - micro-batch processing with small intervals. The technical approaches differ significantly. True real-time requires streaming architectures; near-real-time can often use fast batch processing. Choose based on actual business requirements, not aspirational goals.

Do I need a different data model for streaming vs. batch?

Not necessarily different models, but different optimizations. Streaming models optimize for append operations and time-windowed queries. Batch models optimize for bulk loads and full-scan aggregations. A well-designed model can support both - dimensional models with time partitioning work for streaming ingestion and batch analytics.

How do I handle late-arriving events in real-time models?

Define acceptable lateness windows and watermarks. Events within the window can be incorporated into aggregations. Events outside the window may be processed separately, trigger corrections, or be dropped (with monitoring). The approach depends on business tolerance for delayed data. Document this behavior clearly for analytics consumers.

Should AI analytics use real-time data?

Depends on the use case. For dashboards showing current state, real-time is valuable. For analytical questions about trends or comparisons, near-real-time or batch is usually sufficient and more reliable. Real-time adds complexity - use it when the freshness requirement is genuine, not because it seems impressive.

Data Modeling for Real-Time Analytics: Streaming-First Design Patterns

Real-time analytics processes and presents data as events occur, with latencies measured in seconds rather than hours or days. This requires fundamentally different thinking about data modeling - optimizing for continuous streams rather than batch loads, handling unbounded data, and managing the complexity of systems that never stop.

The demand for real-time analytics continues to grow as businesses expect immediate visibility into operations, customer behavior, and system health. However, real-time adds significant complexity. Understanding when it's needed and how to model for it is essential for successful implementation.

When Real-Time Matters

Not every analytical use case requires real-time data. Be clear about actual requirements.

Strong candidates for real-time

Operational dashboards: Current system health, active user counts, live transaction volumes
Fraud detection: Must identify suspicious patterns before transactions complete
Dynamic pricing: Prices that respond to current demand and inventory
Live personalization: Recommendations based on in-session behavior
Alerting: Triggering notifications when thresholds are breached

Often don't need real-time

Business reporting: Daily, weekly, monthly trends don't benefit from second-level freshness
Strategic analytics: Long-term patterns are stable - yesterday's data is sufficient
Historical analysis: By definition, analyzing the past
Exploratory analysis: Analysts iterating on questions don't need streaming data

Implementing real-time for use cases that don't need it wastes engineering effort and adds operational complexity without corresponding value.

Streaming Data Model Characteristics

Append-Only by Nature

Streaming data arrives continuously and is appended to storage:

-- Events arrive continuously
INSERT INTO events (event_id, user_id, event_type, event_time, properties)
VALUES ('evt_123', 'user_456', 'page_view', NOW(), '{"page": "/products"}');

Models should embrace append-only semantics rather than fighting them. Updates are expensive in streaming contexts - model as new events rather than modifications when possible.

Time-Centric Organization

Time is the primary organizing principle:

Partitioning by time: Daily or hourly partitions for efficient time-range queries
Time-based retention: Automatic expiration of old data
Event time vs. processing time: Distinguishing when events occurred vs. when they were processed

CREATE TABLE events (
  event_id STRING,
  user_id STRING,
  event_type STRING,
  event_time TIMESTAMP,      -- When the event occurred
  processing_time TIMESTAMP, -- When the event was processed
  properties JSON
)
PARTITIONED BY (DATE(event_time));

Windowed Aggregations

Rather than aggregating all data, streaming models aggregate within time windows:

-- 5-minute rolling window
SELECT
  TUMBLE_START(event_time, INTERVAL '5' MINUTE) as window_start,
  event_type,
  COUNT(*) as event_count
FROM events
GROUP BY
  TUMBLE(event_time, INTERVAL '5' MINUTE),
  event_type;

Window types include:

Tumbling windows: Fixed-size, non-overlapping (0:00-0:05, 0:05-0:10)
Sliding windows: Fixed-size, overlapping (0:00-0:05, 0:01-0:06)
Session windows: Dynamic size based on activity gaps

Modeling Patterns for Real-Time

Raw Event Stream

Store raw events as they arrive for maximum flexibility:

CREATE TABLE raw_events (
  event_id STRING PRIMARY KEY,
  event_time TIMESTAMP,
  event_type STRING,
  entity_id STRING,
  payload JSON,
  source_system STRING,
  received_at TIMESTAMP
);

Benefits:

Complete data preservation
Enables reprocessing if transformation logic changes
Supports multiple downstream consumers

This is the foundation layer - other models derive from it.

Aggregated Summaries

Pre-compute common aggregations for fast querying:

CREATE TABLE hourly_event_counts (
  hour_start TIMESTAMP,
  event_type STRING,
  entity_type STRING,
  event_count BIGINT,
  unique_entities BIGINT,
  PRIMARY KEY (hour_start, event_type, entity_type)
);

Updated continuously by streaming processors:

# Streaming aggregation (conceptual)
stream.window(1.hour)
  .groupBy("event_type", "entity_type")
  .agg(count("*"), countDistinct("entity_id"))
  .writeTo("hourly_event_counts")

Materialized State Tables

Maintain current state derived from event streams:

CREATE TABLE current_session_state (
  session_id STRING PRIMARY KEY,
  user_id STRING,
  session_start TIMESTAMP,
  last_activity TIMESTAMP,
  page_view_count INT,
  current_page STRING,
  is_active BOOLEAN
);

Updated on each event:

# On each page_view event for session
UPDATE current_session_state
SET last_activity = event.timestamp,
    page_view_count = page_view_count + 1,
    current_page = event.page,
    is_active = true
WHERE session_id = event.session_id;

Lambda Architecture Tables

Tables that combine batch (accurate) and streaming (fast) data:

CREATE TABLE user_activity_metrics (
  user_id STRING,
  metric_date DATE,
  lifetime_events BIGINT,        -- From batch processing (accurate)
  today_events BIGINT,           -- From streaming (fast but approximate)
  combined_events BIGINT GENERATED ALWAYS AS (lifetime_events + today_events)
);

Batch process updates lifetime_events nightly; streaming updates today_events continuously.

Handling Real-Time Complexity

Late-Arriving Data

Events don't always arrive in order. A mobile app might batch events when offline, sending them hours later.

Watermark strategy: Define how long to wait for late data before closing windows:

# Allow 10 minutes for late data
stream.withWatermark("event_time", "10 minutes")
  .groupBy(window("event_time", "1 hour"))
  .count()

Correction strategy: Process late data separately and publish corrections:

-- Original window result
INSERT INTO hourly_counts VALUES ('2024-02-15 10:00', 'page_view', 1523);

-- Correction after late data arrives
INSERT INTO hourly_count_corrections
VALUES ('2024-02-15 10:00', 'page_view', 1523, 1547, NOW());

Exactly-Once Semantics

Ensuring each event is processed exactly once - not skipped, not duplicated:

Idempotent writes: Using event_id as key, duplicate events overwrite rather than add
Transactional updates: Atomic commits of processing progress and output
Deduplication: Explicit tracking of processed event IDs

Backpressure Handling

When downstream can't keep up with upstream event rates:

Buffering: Queue events temporarily during spikes
Sampling: Process representative subset under load
Degradation: Reduce aggregation granularity when overwhelmed

Model design affects backpressure tolerance - simpler aggregations handle spikes better.

Storage Considerations

Append-Optimized Storage

Choose storage optimized for write-heavy workloads:

Time-series databases: InfluxDB, TimescaleDB, QuestDB
Streaming-native stores: Apache Kafka (with ksqlDB), Apache Druid
Columnar with streaming support: ClickHouse, Apache Pinot

Tiered Storage

Hot-warm-cold architecture for cost efficiency:

Hot tier: Recent data (hours/days) in fast storage for real-time queries
Warm tier: Recent history (weeks/months) in columnar storage
Cold tier: Historical data in object storage for batch access

Compaction Strategies

As data ages, compact for efficiency:

Merge small time-partitioned files
Roll up detailed events into aggregates
Apply late-arriving corrections into base tables

AI Analytics Considerations

Freshness vs. Stability Trade-off

AI queries benefit from stable data - results that don't change between query executions. Real-time data is inherently unstable.

Recommendations:

For AI-driven insights, prefer micro-batch over pure streaming
Document expected data stability in semantic layer
Consider serving yesterday's batch data for consistent AI training

Query Result Caching

Real-time data makes caching tricky - cached results quickly become stale.

Approaches:

Short TTL caching (seconds to minutes)
Semantic cache invalidation (new data for queried time ranges invalidates)
Accept slight staleness for non-critical queries

Semantic Layer for Streaming

Document real-time characteristics:

table:
  name: live_session_metrics
  update_pattern: streaming
  typical_latency: 5_seconds
  consistency: eventual
  late_data_handling: 10_minute_window

  metrics:
    - name: active_sessions
      description: Currently active user sessions
      calculation: COUNT(*) WHERE is_active = true
      freshness: real_time
      stability: low  # Value changes continuously

This helps AI systems set appropriate expectations and choose between real-time and batch sources based on query requirements.

Common Mistakes

Over-engineering for real-time: Building streaming infrastructure for use cases that would be fine with hourly batch. The complexity cost is rarely justified.

Ignoring late data: Assuming events arrive in order. They don't - model accordingly.

Missing backpressure planning: Systems that work under normal load but fail under spikes. Design for peak, not average.

Coupling streaming and serving: Using the same system for stream processing and analytical queries. Separate concerns - streaming processors feed serving layers.

Forgetting batch fallback: Real-time systems fail. Have batch-based recovery paths for when streaming breaks.

Real-time data modeling enables powerful capabilities but requires careful design. Start with clear requirements - actual latency needs, not aspirational goals. Build incrementally, prove value at each stage, and maintain the operational discipline that continuous processing demands.