Hierarchical Data Modeling: Representing Parent-Child Relationships in Analytics
Hierarchical data - organizational structures, product categories, geographic regions - requires special modeling techniques. Learn the patterns for modeling, querying, and maintaining hierarchical data for analytics and AI.
Hierarchical data represents parent-child relationships where entities are organized in tree structures. Common examples include organizational charts (CEO to departments to teams to individuals), geographic regions (country to state to city to zip code), and product categories (electronics to computers to laptops to gaming laptops).
Modeling hierarchies correctly is essential for analytics that roll up or drill down through these structures. Poor hierarchy modeling leads to incorrect aggregations, broken drill-down paths, and AI systems that cannot navigate organizational structures properly.
Hierarchy Characteristics
Depth
Fixed depth: Every branch has the same number of levels
- Geographic: Country (1) > State (2) > City (3) > Zip (4)
- Time: Year > Quarter > Month > Day
Variable depth: Branches have different numbers of levels
- Org chart: CEO may have 3 levels to some roles, 7 to others
- Product categories: Some products have 2 category levels, others 6
Balance
Balanced hierarchies: All leaf nodes are at the same level
- Calendar hierarchy: Every day is at level 4 (Year > Quarter > Month > Day)
Ragged hierarchies: Leaf nodes appear at different levels
- Geography: Some sales go to City level, others only to Country level
Cardinality
Single-parent: Each node has exactly one parent
- Standard org chart: Each employee has one manager
Multi-parent: Nodes can have multiple parents
- Product attributes: A laptop might be in both "Electronics > Computers" and "Office > Equipment"
Modeling Patterns
Flattened Columns (Fixed Depth)
For fixed-depth hierarchies, denormalize into separate columns:
CREATE TABLE dim_geography (
geo_key INT PRIMARY KEY,
zip_code VARCHAR(10),
city VARCHAR(100),
state VARCHAR(50),
country VARCHAR(50),
region VARCHAR(50) -- e.g., "North America", "EMEA"
);
Every level is a column. Queries are simple:
-- Aggregate to country level
SELECT country, SUM(sales)
FROM facts f
JOIN dim_geography g ON f.geo_key = g.geo_key
GROUP BY country;
-- Drill down from country to state
SELECT state, SUM(sales)
FROM facts f
JOIN dim_geography g ON f.geo_key = g.geo_key
WHERE country = 'USA'
GROUP BY state;
Pros:
- Simple queries - no recursive logic needed
- Excellent query performance
- Easy for AI systems to understand
Cons:
- Schema changes needed for new levels
- Doesn't handle variable-depth hierarchies
- Redundant storage of parent values
Adjacency List (Variable Depth)
Each row stores a reference to its parent:
CREATE TABLE dim_org (
employee_key INT PRIMARY KEY,
employee_name VARCHAR(100),
title VARCHAR(100),
parent_key INT REFERENCES dim_org(employee_key),
hierarchy_level INT -- Distance from root
);
Data example:
| employee_key | employee_name | parent_key | hierarchy_level |
|---|---|---|---|
| 1 | Sarah (CEO) | NULL | 0 |
| 2 | Mike (VP Sales) | 1 | 1 |
| 3 | Lisa (Sales Manager) | 2 | 2 |
| 4 | Tom (Sales Rep) | 3 | 3 |
Querying requires recursion:
-- All employees under Mike (VP Sales)
WITH RECURSIVE subordinates AS (
SELECT employee_key, employee_name, parent_key, 1 as depth
FROM dim_org
WHERE employee_key = 2
UNION ALL
SELECT o.employee_key, o.employee_name, o.parent_key, s.depth + 1
FROM dim_org o
JOIN subordinates s ON o.parent_key = s.employee_key
)
SELECT * FROM subordinates;
Pros:
- Handles any depth
- Simple to update (change parent_key)
- Natural representation
Cons:
- Recursive queries are complex
- Performance degrades with deep hierarchies
- Not supported in all analytics tools
Path Enumeration (Materialized Path)
Store the full path from root to each node:
CREATE TABLE dim_category (
category_key INT PRIMARY KEY,
category_name VARCHAR(100),
category_path VARCHAR(500), -- e.g., '/Electronics/Computers/Laptops'
path_array TEXT[], -- e.g., ['Electronics', 'Computers', 'Laptops']
hierarchy_level INT
);
Queries use path matching:
-- All categories under Electronics
SELECT * FROM dim_category
WHERE category_path LIKE '/Electronics/%';
-- All ancestor categories of Laptops
SELECT * FROM dim_category
WHERE '/Electronics/Computers/Laptops' LIKE category_path || '%';
Pros:
- Fast subtree queries (index on path)
- No recursion needed
- Works in all databases
Cons:
- Path updates cascade to all descendants
- Path length limits may apply
- Separator character must not appear in names
Nested Sets
Each node stores left and right bounds that encompass all descendants:
CREATE TABLE dim_product_category (
category_key INT PRIMARY KEY,
category_name VARCHAR(100),
lft INT, -- Left bound
rgt INT, -- Right bound
hierarchy_level INT
);
For a hierarchy:
Electronics (1, 10)
├── Computers (2, 7)
│ ├── Laptops (3, 4)
│ └── Desktops (5, 6)
└── Audio (8, 9)
Queries are elegant:
-- All descendants of Computers (lft=2, rgt=7)
SELECT * FROM dim_product_category
WHERE lft > 2 AND rgt < 7;
-- All ancestors of Laptops (lft=3, rgt=4)
SELECT * FROM dim_product_category
WHERE lft < 3 AND rgt > 4;
-- Count of products by category including subcategories
SELECT c.category_name, COUNT(*)
FROM products p
JOIN dim_product_category pc ON p.category_key = pc.category_key
JOIN dim_product_category c ON pc.lft BETWEEN c.lft AND c.rgt
GROUP BY c.category_key, c.category_name;
Pros:
- Very fast read queries
- No recursion needed
- Efficient subtree operations
Cons:
- Updates are expensive (may require renumbering)
- Not intuitive to understand
- Insert/delete complexity
Bridge Tables
A separate table mapping all ancestor-descendant relationships:
CREATE TABLE org_hierarchy_bridge (
ancestor_key INT,
descendant_key INT,
distance INT, -- Levels between them
PRIMARY KEY (ancestor_key, descendant_key)
);
For employee 4 (Tom) under manager 3 under VP 2 under CEO 1:
| ancestor_key | descendant_key | distance |
|---|---|---|
| 1 | 1 | 0 |
| 1 | 2 | 1 |
| 1 | 3 | 2 |
| 1 | 4 | 3 |
| 2 | 2 | 0 |
| 2 | 3 | 1 |
| 2 | 4 | 2 |
| 3 | 3 | 0 |
| 3 | 4 | 1 |
| 4 | 4 | 0 |
Queries use simple joins:
-- All sales under VP Mike (employee_key = 2)
SELECT SUM(f.sales)
FROM facts f
JOIN org_hierarchy_bridge b ON f.employee_key = b.descendant_key
WHERE b.ancestor_key = 2;
Pros:
- Fast queries at any aggregation level
- No recursion
- Flexible for complex hierarchies
Cons:
- Storage grows quadratically with hierarchy depth
- Maintenance complexity on hierarchy changes
- Bridge must be rebuilt or updated when hierarchy changes
Handling Ragged Hierarchies
When hierarchies have varying depths, several approaches work:
Repeat Values
Repeat the lowest known level up the missing levels:
| product | category_l3 | category_l2 | category_l1 |
|---|---|---|---|
| Widget A | Widgets | Widgets | Hardware |
| Gadget B | Gadgets | Electronics | Hardware |
"Widgets" appears at L3 and L2 because there's no intermediate level.
NULL Values
Use NULL for missing levels:
| product | category_l3 | category_l2 | category_l1 |
|---|---|---|---|
| Widget A | NULL | Widgets | Hardware |
| Gadget B | Gadgets | Electronics | Hardware |
Queries must handle NULLs appropriately.
Skip Levels
Include a level indicator and query conditionally:
SELECT
CASE level
WHEN 3 THEN category_l3
WHEN 2 THEN category_l2
WHEN 1 THEN category_l1
END as leaf_category
FROM products;
AI Analytics Considerations
Hierarchical data presents specific challenges for AI systems.
Level Ambiguity
When asked "What are the sales by region?" - does region mean:
- Geographic region (North America, EMEA)?
- Sales region (as defined by sales org)?
- The state/province level?
The semantic layer must define hierarchy levels explicitly:
hierarchy:
name: geography
levels:
- name: region
description: Continental grouping (North America, EMEA, APAC)
- name: country
description: Nation-level
- name: state
description: State, province, or equivalent subdivision
- name: city
description: Metropolitan area
Aggregate Ambiguity
"Total sales for Electronics" - does this include:
- Just products directly in Electronics?
- Products in all subcategories (Computers, Audio)?
Document the default behavior:
dimension:
name: product_category
hierarchy_aggregation: inclusive # includes all descendants
Historical Hierarchies
When org structures change, historical analysis becomes complex:
- Q1 sales by team - which team structure?
- If team A split into teams B and C, how to compare?
Consider maintaining hierarchy snapshots or clearly documenting that analyses use current structure.
Query Optimization
Index Strategies
- Flattened columns: Index each level column
- Adjacency list: Index parent_key for traversal
- Path enumeration: Index path for LIKE queries
- Nested sets: Index (lft, rgt) for range queries
- Bridge tables: Index on ancestor_key and descendant_key
Materialized Aggregates
Pre-compute aggregates at each hierarchy level:
CREATE TABLE sales_by_category AS
SELECT
b.ancestor_key as category_key,
SUM(f.sales) as total_sales
FROM facts f
JOIN category_bridge b ON f.category_key = b.descendant_key
GROUP BY b.ancestor_key;
Refresh as data or hierarchy changes.
Flattening for Analytics
Even if you maintain normalized hierarchies for flexibility, consider flattened views for analytics:
CREATE VIEW dim_org_flat AS
WITH RECURSIVE org_tree AS (
-- Build hierarchy recursively, then flatten
...
)
SELECT
employee_key,
employee_name,
manager_name,
director_name,
vp_name,
ceo_name
FROM org_tree;
AI and BI tools query the simple view; the underlying structure remains flexible.
Hierarchical data modeling requires matching the approach to hierarchy characteristics and query patterns. Fixed-depth hierarchies suit flattened columns. Variable-depth hierarchies need adjacency lists, path enumeration, or bridge tables. Document hierarchy semantics clearly in the semantic layer so AI systems can navigate structures correctly.
Questions
It depends on hierarchy characteristics. For fixed-depth hierarchies (Country > State > City), flattened columns work well. For variable-depth or frequently changing hierarchies (org charts, product taxonomies), adjacency lists or path enumeration are more flexible. Consider query patterns too - drill-down queries favor different structures than aggregate-up queries.