Activity Schema Modeling: Event-Centric Data Modeling for Modern Analytics
Activity schema is a data modeling approach that organizes data around business activities rather than entities. Learn how activity schemas work, when to use them, and how they support AI-powered analytics.
Activity schema is a data modeling approach that organizes analytical data around activities - the actions and events that entities perform over time. Rather than modeling separate fact tables for different business processes, activity schema consolidates all activities into a unified structure optimized for behavioral analysis.
The approach has gained popularity with the rise of product analytics, customer journey analysis, and behavioral segmentation. When the primary analytical questions focus on "what did users do and in what sequence," activity schema provides a simpler, more flexible foundation than traditional dimensional models.
Core Concept: Activities as the Center
In traditional dimensional modeling, different business processes get separate fact tables:
- Orders fact table for purchases
- Page views fact table for web activity
- Support tickets fact table for service interactions
- Email events fact table for marketing engagement
Each table has its own schema optimized for that process. Cross-process analysis requires joining these tables - often complex and slow.
Activity schema inverts this by creating a single activities table:
CREATE TABLE activities (
activity_id BIGINT PRIMARY KEY,
entity_id VARCHAR(100), -- Customer, user, or account
activity_type VARCHAR(100), -- Order, page_view, ticket_opened, email_clicked
activity_timestamp TIMESTAMP,
activity_date DATE,
-- Common dimensions
channel VARCHAR(50),
device_type VARCHAR(50),
geo_country VARCHAR(50),
-- Flexible attributes
activity_attributes JSONB, -- Activity-specific data
-- Metrics (when applicable)
revenue_amount DECIMAL(12,2),
quantity INT
);
Every activity - regardless of type - goes in this table. Behavioral analysis becomes straightforward:
-- User journey: all activities for a customer in sequence
SELECT activity_type, activity_timestamp
FROM activities
WHERE entity_id = 'customer_123'
ORDER BY activity_timestamp;
-- Conversion funnel: activities leading to purchase
SELECT entity_id,
MAX(CASE WHEN activity_type = 'page_view' THEN activity_timestamp END) as first_view,
MAX(CASE WHEN activity_type = 'add_to_cart' THEN activity_timestamp END) as cart_add,
MAX(CASE WHEN activity_type = 'purchase' THEN activity_timestamp END) as purchase_time
FROM activities
WHERE entity_id IN (SELECT entity_id FROM activities WHERE activity_type = 'purchase')
GROUP BY entity_id;
Schema Design Patterns
The Standard Activity Table
The minimal activity schema includes:
| Column | Description |
|---|---|
| activity_id | Unique identifier for each activity |
| entity_id | The actor performing the activity (user, customer, account) |
| activity_type | Classification of the activity |
| activity_timestamp | When the activity occurred |
| activity_attributes | Flexible storage for type-specific data |
Additional common columns might include:
- Session ID for grouping related activities
- Channel/source information
- Geographic attributes
- Revenue/quantity when applicable
Handling Activity-Specific Attributes
Different activities have different attributes - an order has products and amounts, a page view has URL and duration, an email click has campaign and link. Options for handling this:
JSON columns: Store activity-specific data in a JSON/JSONB column
-- Order activity
{"order_id": "123", "product_ids": ["A", "B"], "total": 99.50}
-- Page view activity
{"url": "/products/widget", "duration_seconds": 45, "referrer": "google"}
Pros: Fully flexible, no schema changes needed Cons: Query performance on JSON, no type safety
Denormalized columns: Include common attributes as columns, NULL when not applicable
CREATE TABLE activities (
...
order_id VARCHAR(50), -- Only for order activities
page_url VARCHAR(500), -- Only for page view activities
email_campaign_id VARCHAR(50), -- Only for email activities
...
);
Pros: Query performance, type safety Cons: Sparse columns, schema changes for new attributes
Linked detail tables: Keep activities lean, link to type-specific tables
CREATE TABLE activity_order_details (
activity_id BIGINT PRIMARY KEY,
order_id VARCHAR(50),
product_count INT,
total_amount DECIMAL(12,2)
);
Pros: Clean schema, type-specific optimization Cons: Requires joins for complete data
In practice, hybrid approaches work well - common high-value attributes as columns, everything else in JSON.
Entity Resolution
Activity schema requires consistent entity identification. When activities come from different systems:
- Web analytics tracks anonymous cookie IDs
- App analytics tracks device IDs
- CRM tracks customer IDs
- Support tracks email addresses
An identity resolution layer maps these to a single entity_id:
CREATE TABLE entity_identities (
entity_id VARCHAR(100),
identity_type VARCHAR(50),
identity_value VARCHAR(200),
confidence_score DECIMAL(3,2),
PRIMARY KEY (identity_type, identity_value)
);
Activities are joined through this mapping to create unified behavioral views.
Query Patterns
Sequential Analysis
Activity schema excels at sequential queries - what happened before/after specific activities:
-- What do users do after viewing pricing page?
WITH pricing_views AS (
SELECT entity_id, activity_timestamp as view_time
FROM activities
WHERE activity_type = 'page_view'
AND activity_attributes->>'url' = '/pricing'
)
SELECT a.activity_type, COUNT(*) as occurrences
FROM activities a
JOIN pricing_views p ON a.entity_id = p.entity_id
WHERE a.activity_timestamp > p.view_time
AND a.activity_timestamp < p.view_time + INTERVAL '1 hour'
GROUP BY a.activity_type
ORDER BY occurrences DESC;
Funnel Analysis
Multi-step conversion funnels are natural in activity schema:
-- Signup funnel completion rates
SELECT
COUNT(DISTINCT CASE WHEN step >= 1 THEN entity_id END) as visited_signup,
COUNT(DISTINCT CASE WHEN step >= 2 THEN entity_id END) as started_form,
COUNT(DISTINCT CASE WHEN step >= 3 THEN entity_id END) as completed_signup
FROM (
SELECT entity_id,
MAX(CASE WHEN activity_type = 'page_view' AND activity_attributes->>'url' = '/signup' THEN 1
WHEN activity_type = 'form_start' THEN 2
WHEN activity_type = 'signup_complete' THEN 3 END) as step
FROM activities
WHERE activity_date >= CURRENT_DATE - 30
GROUP BY entity_id
) funnel;
Cohort Analysis
Grouping entities by shared characteristics and comparing behaviors:
-- Retention by signup cohort
WITH cohorts AS (
SELECT entity_id, DATE_TRUNC('week', MIN(activity_timestamp)) as cohort_week
FROM activities
WHERE activity_type = 'signup_complete'
GROUP BY entity_id
)
SELECT
c.cohort_week,
COUNT(DISTINCT c.entity_id) as cohort_size,
COUNT(DISTINCT CASE WHEN a.activity_timestamp >= c.cohort_week + INTERVAL '7 days'
AND a.activity_timestamp < c.cohort_week + INTERVAL '14 days'
THEN a.entity_id END) as week_2_active
FROM cohorts c
LEFT JOIN activities a ON c.entity_id = a.entity_id
GROUP BY c.cohort_week;
Session Analysis
Grouping activities into sessions for engagement analysis:
-- Session duration and depth
WITH session_bounds AS (
SELECT entity_id,
activity_timestamp,
SUM(CASE WHEN activity_timestamp - LAG(activity_timestamp)
OVER (PARTITION BY entity_id ORDER BY activity_timestamp)
> INTERVAL '30 minutes' THEN 1 ELSE 0 END)
OVER (PARTITION BY entity_id ORDER BY activity_timestamp) as session_num
FROM activities
)
SELECT entity_id, session_num,
MIN(activity_timestamp) as session_start,
MAX(activity_timestamp) as session_end,
COUNT(*) as activities_in_session
FROM session_bounds
GROUP BY entity_id, session_num;
AI Analytics Considerations
Activity schema's flat structure is well-suited for AI analytics, but requires proper semantic definition.
Clear Activity Type Taxonomy
AI systems need to understand what each activity type represents:
activity_types:
- name: page_view
description: User viewed a page on the website or app
entity_type: user
attributes:
- url: The page URL viewed
- duration_seconds: Time spent on page
- name: purchase
description: User completed a purchase transaction
entity_type: user
attributes:
- order_id: Unique order identifier
- revenue: Transaction amount in USD
Without this documentation, AI cannot reliably interpret activity_type values.
Behavioral Metric Definitions
Common behavioral metrics should be explicitly defined:
metrics:
- name: conversion_rate
description: Percentage of users who purchased after viewing product
calculation: |
COUNT(DISTINCT entity_id WHERE activity_type = 'purchase') /
COUNT(DISTINCT entity_id WHERE activity_type = 'product_view')
valid_dimensions: [channel, device_type, cohort_week]
AI systems can use these definitions rather than inferring calculations.
Temporal Relationship Clarity
Behavioral analysis often involves temporal relationships:
- Activities within a session
- Activities within N days of signup
- Activities before/after specific events
The semantic layer should define standard temporal contexts:
temporal_contexts:
- name: first_session
description: Activities in the user's first session
definition: Activities within 30 minutes of first activity
- name: trial_period
description: Activities during the 14-day trial period
definition: Activities within 14 days of signup_complete activity
When to Use Activity Schema
Strong fit:
- Product analytics and user behavior analysis
- Customer journey mapping
- Funnel and conversion analysis
- Behavioral segmentation
- Engagement and retention metrics
Less suitable:
- Detailed transactional reporting
- Financial analysis requiring GAAP compliance
- Inventory and supply chain operations
- Complex multi-entity business processes
Activity schema complements rather than replaces dimensional modeling. Organizations often use both - activity schema for behavioral analytics, dimensional models for operational and financial reporting.
Implementation Considerations
Partitioning: Partition by date for query performance - most behavioral queries are time-bounded.
Indexing: Index on (entity_id, activity_timestamp) for journey queries and (activity_type, activity_timestamp) for funnel queries.
Data volume: Activity tables grow quickly. Plan for data lifecycle management - archiving older activities or aggregating to summaries.
Real-time vs. batch: Decide latency requirements. Many activity schemas are near-real-time with streaming ingestion.
Activity schema provides a powerful foundation for behavioral analytics when the primary questions center on what entities did and how those actions relate to outcomes. Combined with proper semantic layer documentation, it enables both human analysts and AI systems to understand user behavior effectively.
Questions
Traditional dimensional modeling centers on fact tables for specific business processes (orders, shipments, support tickets). Activity schema centers on a single activities table capturing all customer or user actions. Rather than joining across fact tables, you query one table with activity type filters. This simplifies behavioral analysis but requires different query patterns.