What's the best way to model hierarchies in a data warehouse?

It depends on hierarchy characteristics. For fixed-depth hierarchies (Country > State > City), flattened columns work well. For variable-depth or frequently changing hierarchies (org charts, product taxonomies), adjacency lists or path enumeration are more flexible. Consider query patterns too - drill-down queries favor different structures than aggregate-up queries.

How do I handle ragged hierarchies?

Ragged hierarchies have varying depths across branches (some products have 3 category levels, others have 5). Options include: padding shorter branches with repeated values, using NULL for missing levels, or modeling with variable-depth structures like path enumeration. The choice affects query complexity and result presentation.

Should hierarchies be in fact tables or dimension tables?

Hierarchies belong in dimension tables. Fact tables should contain foreign keys to the lowest level of the hierarchy (individual product, specific location, direct employee). The dimension table then provides hierarchical attributes for rolling up or filtering. This maintains proper grain in fact tables.

How do historical hierarchy changes affect analytics?

Hierarchy changes (reorgs, category restructuring) create historical analysis challenges. If you analyze 2023 sales with 2024 org structure, results may not match what was reported in 2023. Consider maintaining historical hierarchy versions (Type 2 dimensions) or clearly documenting that historical analyses use current hierarchy structure.

Hierarchical Data Modeling: Representing Parent-Child Relationships in Analytics

Hierarchical data represents parent-child relationships where entities are organized in tree structures. Common examples include organizational charts (CEO to departments to teams to individuals), geographic regions (country to state to city to zip code), and product categories (electronics to computers to laptops to gaming laptops).

Modeling hierarchies correctly is essential for analytics that roll up or drill down through these structures. Poor hierarchy modeling leads to incorrect aggregations, broken drill-down paths, and AI systems that cannot navigate organizational structures properly.

Hierarchy Characteristics

Depth

Fixed depth: Every branch has the same number of levels

Geographic: Country (1) > State (2) > City (3) > Zip (4)
Time: Year > Quarter > Month > Day

Variable depth: Branches have different numbers of levels

Org chart: CEO may have 3 levels to some roles, 7 to others
Product categories: Some products have 2 category levels, others 6

Balance

Balanced hierarchies: All leaf nodes are at the same level

Calendar hierarchy: Every day is at level 4 (Year > Quarter > Month > Day)

Ragged hierarchies: Leaf nodes appear at different levels

Geography: Some sales go to City level, others only to Country level

Cardinality

Single-parent: Each node has exactly one parent

Standard org chart: Each employee has one manager

Multi-parent: Nodes can have multiple parents

Product attributes: A laptop might be in both "Electronics > Computers" and "Office > Equipment"

Modeling Patterns

Flattened Columns (Fixed Depth)

For fixed-depth hierarchies, denormalize into separate columns:

CREATE TABLE dim_geography (
  geo_key INT PRIMARY KEY,
  zip_code VARCHAR(10),
  city VARCHAR(100),
  state VARCHAR(50),
  country VARCHAR(50),
  region VARCHAR(50)        -- e.g., "North America", "EMEA"
);

Every level is a column. Queries are simple:

-- Aggregate to country level
SELECT country, SUM(sales)
FROM facts f
JOIN dim_geography g ON f.geo_key = g.geo_key
GROUP BY country;

-- Drill down from country to state
SELECT state, SUM(sales)
FROM facts f
JOIN dim_geography g ON f.geo_key = g.geo_key
WHERE country = 'USA'
GROUP BY state;

Pros:

Simple queries - no recursive logic needed
Excellent query performance
Easy for AI systems to understand

Cons:

Schema changes needed for new levels
Doesn't handle variable-depth hierarchies
Redundant storage of parent values

Adjacency List (Variable Depth)

Each row stores a reference to its parent:

CREATE TABLE dim_org (
  employee_key INT PRIMARY KEY,
  employee_name VARCHAR(100),
  title VARCHAR(100),
  parent_key INT REFERENCES dim_org(employee_key),
  hierarchy_level INT       -- Distance from root
);

Data example:

employee_key	employee_name	parent_key	hierarchy_level
1	Sarah (CEO)	NULL	0
2	Mike (VP Sales)	1	1
3	Lisa (Sales Manager)	2	2
4	Tom (Sales Rep)	3	3

Querying requires recursion:

-- All employees under Mike (VP Sales)
WITH RECURSIVE subordinates AS (
  SELECT employee_key, employee_name, parent_key, 1 as depth
  FROM dim_org
  WHERE employee_key = 2

  UNION ALL

  SELECT o.employee_key, o.employee_name, o.parent_key, s.depth + 1
  FROM dim_org o
  JOIN subordinates s ON o.parent_key = s.employee_key
)
SELECT * FROM subordinates;

Pros:

Handles any depth
Simple to update (change parent_key)
Natural representation

Cons:

Recursive queries are complex
Performance degrades with deep hierarchies
Not supported in all analytics tools

Path Enumeration (Materialized Path)

Store the full path from root to each node:

CREATE TABLE dim_category (
  category_key INT PRIMARY KEY,
  category_name VARCHAR(100),
  category_path VARCHAR(500),    -- e.g., '/Electronics/Computers/Laptops'
  path_array TEXT[],             -- e.g., ['Electronics', 'Computers', 'Laptops']
  hierarchy_level INT
);

Queries use path matching:

-- All categories under Electronics
SELECT * FROM dim_category
WHERE category_path LIKE '/Electronics/%';

-- All ancestor categories of Laptops
SELECT * FROM dim_category
WHERE '/Electronics/Computers/Laptops' LIKE category_path || '%';

Pros:

Fast subtree queries (index on path)
No recursion needed
Works in all databases

Cons:

Path updates cascade to all descendants
Path length limits may apply
Separator character must not appear in names

Nested Sets

Each node stores left and right bounds that encompass all descendants:

CREATE TABLE dim_product_category (
  category_key INT PRIMARY KEY,
  category_name VARCHAR(100),
  lft INT,                      -- Left bound
  rgt INT,                      -- Right bound
  hierarchy_level INT
);

For a hierarchy:

Electronics (1, 10)
  ├── Computers (2, 7)
  │   ├── Laptops (3, 4)
  │   └── Desktops (5, 6)
  └── Audio (8, 9)

Queries are elegant:

-- All descendants of Computers (lft=2, rgt=7)
SELECT * FROM dim_product_category
WHERE lft > 2 AND rgt < 7;

-- All ancestors of Laptops (lft=3, rgt=4)
SELECT * FROM dim_product_category
WHERE lft < 3 AND rgt > 4;

-- Count of products by category including subcategories
SELECT c.category_name, COUNT(*)
FROM products p
JOIN dim_product_category pc ON p.category_key = pc.category_key
JOIN dim_product_category c ON pc.lft BETWEEN c.lft AND c.rgt
GROUP BY c.category_key, c.category_name;

Pros:

Very fast read queries
No recursion needed
Efficient subtree operations

Cons:

Updates are expensive (may require renumbering)
Not intuitive to understand
Insert/delete complexity

Bridge Tables

A separate table mapping all ancestor-descendant relationships:

CREATE TABLE org_hierarchy_bridge (
  ancestor_key INT,
  descendant_key INT,
  distance INT,               -- Levels between them
  PRIMARY KEY (ancestor_key, descendant_key)
);

For employee 4 (Tom) under manager 3 under VP 2 under CEO 1:

ancestor_key	descendant_key	distance
1	1	0
1	2	1
1	3	2
1	4	3
2	2	0
2	3	1
2	4	2
3	3	0
3	4	1
4	4	0

Queries use simple joins:

-- All sales under VP Mike (employee_key = 2)
SELECT SUM(f.sales)
FROM facts f
JOIN org_hierarchy_bridge b ON f.employee_key = b.descendant_key
WHERE b.ancestor_key = 2;

Pros:

Fast queries at any aggregation level
No recursion
Flexible for complex hierarchies

Cons:

Storage grows quadratically with hierarchy depth
Maintenance complexity on hierarchy changes
Bridge must be rebuilt or updated when hierarchy changes

Handling Ragged Hierarchies

When hierarchies have varying depths, several approaches work:

Repeat Values

Repeat the lowest known level up the missing levels:

product	category_l3	category_l2	category_l1
Widget A	Widgets	Widgets	Hardware
Gadget B	Gadgets	Electronics	Hardware

"Widgets" appears at L3 and L2 because there's no intermediate level.

NULL Values

Use NULL for missing levels:

product	category_l3	category_l2	category_l1
Widget A	NULL	Widgets	Hardware
Gadget B	Gadgets	Electronics	Hardware

Queries must handle NULLs appropriately.

Skip Levels

Include a level indicator and query conditionally:

SELECT
  CASE level
    WHEN 3 THEN category_l3
    WHEN 2 THEN category_l2
    WHEN 1 THEN category_l1
  END as leaf_category
FROM products;

AI Analytics Considerations

Hierarchical data presents specific challenges for AI systems.

Level Ambiguity

When asked "What are the sales by region?" - does region mean:

Geographic region (North America, EMEA)?
Sales region (as defined by sales org)?
The state/province level?

The semantic layer must define hierarchy levels explicitly:

hierarchy:
  name: geography
  levels:
    - name: region
      description: Continental grouping (North America, EMEA, APAC)
    - name: country
      description: Nation-level
    - name: state
      description: State, province, or equivalent subdivision
    - name: city
      description: Metropolitan area

Aggregate Ambiguity

"Total sales for Electronics" - does this include:

Just products directly in Electronics?
Products in all subcategories (Computers, Audio)?

Document the default behavior:

dimension:
  name: product_category
  hierarchy_aggregation: inclusive  # includes all descendants

Historical Hierarchies

When org structures change, historical analysis becomes complex:

Q1 sales by team - which team structure?
If team A split into teams B and C, how to compare?

Consider maintaining hierarchy snapshots or clearly documenting that analyses use current structure.

Query Optimization

Index Strategies

Flattened columns: Index each level column
Adjacency list: Index parent_key for traversal
Path enumeration: Index path for LIKE queries
Nested sets: Index (lft, rgt) for range queries
Bridge tables: Index on ancestor_key and descendant_key

Materialized Aggregates

Pre-compute aggregates at each hierarchy level:

CREATE TABLE sales_by_category AS
SELECT
  b.ancestor_key as category_key,
  SUM(f.sales) as total_sales
FROM facts f
JOIN category_bridge b ON f.category_key = b.descendant_key
GROUP BY b.ancestor_key;

Refresh as data or hierarchy changes.

Flattening for Analytics

Even if you maintain normalized hierarchies for flexibility, consider flattened views for analytics:

CREATE VIEW dim_org_flat AS
WITH RECURSIVE org_tree AS (
  -- Build hierarchy recursively, then flatten
  ...
)
SELECT
  employee_key,
  employee_name,
  manager_name,
  director_name,
  vp_name,
  ceo_name
FROM org_tree;

AI and BI tools query the simple view; the underlying structure remains flexible.

Hierarchical data modeling requires matching the approach to hierarchy characteristics and query patterns. Fixed-depth hierarchies suit flattened columns. Variable-depth hierarchies need adjacency lists, path enumeration, or bridge tables. Document hierarchy semantics clearly in the semantic layer so AI systems can navigate structures correctly.

Hierarchical Data Modeling: Representing Parent-Child Relationships in Analytics

Hierarchy Characteristics

Depth

Balance

Cardinality

Modeling Patterns

Flattened Columns (Fixed Depth)

Adjacency List (Variable Depth)

Path Enumeration (Materialized Path)

Nested Sets

Bridge Tables

Handling Ragged Hierarchies

Repeat Values

NULL Values

Skip Levels

AI Analytics Considerations

Level Ambiguity

Aggregate Ambiguity

Historical Hierarchies

Query Optimization

Index Strategies

Materialized Aggregates

Flattening for Analytics

Questions

Related