Metadata Extraction from Redshift: Building Semantic Layers from Your Data Warehouse
Learn how to extract metadata from Amazon Redshift for semantic layers, including system catalog queries, schema discovery, and relationship identification for AI-powered analytics.
Metadata extraction from Amazon Redshift is the process of programmatically reading schema structures, table definitions, column information, and relationships from your Redshift data warehouse to build intelligent semantic layers. Redshift provides comprehensive system catalogs and views that expose detailed metadata about your data warehouse structure, enabling automated discovery and semantic model generation.
This extracted metadata forms the foundation for AI-powered analytics, allowing semantic layers to understand your data warehouse and generate accurate queries.
Redshift Metadata Architecture
System Catalog Structure
Redshift provides metadata through multiple interfaces:
PostgreSQL system catalogs (pg_catalog):
- pg_tables, pg_views
- pg_columns
- pg_constraint
- pg_namespace
Redshift system views (svv_):
- svv_tables, svv_columns
- svv_table_info
- svv_external_tables
- svv_all_schemas
Information schema:
- information_schema.tables
- information_schema.columns
- information_schema.table_constraints
Metadata Components
Redshift stores rich metadata:
- Database and schema organization
- Table definitions and storage properties
- Column names, types, and encoding
- Distribution and sort key configuration
- Primary and foreign key constraints
- Table statistics and row counts
- Query history and access patterns
Codd AI Integrations connect to Redshift system catalogs to extract this metadata automatically, building semantic layers that understand your data warehouse structure and optimize query generation.
Metadata Extraction Queries
Schema and Table Discovery
Query available schemas and tables:
-- List all schemas
SELECT schema_name, schema_owner
FROM svv_all_schemas
WHERE schema_name NOT IN ('pg_catalog', 'information_schema');
-- Discover tables with details
SELECT
schema_name,
table_name,
table_type,
table_owner,
remarks
FROM svv_tables
WHERE schema_name = 'analytics';
-- Table statistics and storage
SELECT
schema as schema_name,
"table" as table_name,
diststyle,
sortkey1,
tbl_rows,
size as size_mb
FROM svv_table_info
WHERE schema = 'analytics';
Column Metadata
Extract detailed column information:
-- Column definitions
SELECT
table_schema,
table_name,
column_name,
ordinal_position,
data_type,
character_maximum_length,
numeric_precision,
is_nullable,
column_default
FROM svv_columns
WHERE table_schema = 'analytics'
ORDER BY table_name, ordinal_position;
-- Column encoding and compression
SELECT
schemaname,
tablename,
column,
type,
encoding,
distkey,
sortkey
FROM pg_table_def
WHERE schemaname = 'analytics';
Relationship Discovery
Identify table relationships:
-- Primary keys
SELECT
tc.constraint_name,
tc.table_schema,
tc.table_name,
kcu.column_name
FROM information_schema.table_constraints tc
JOIN information_schema.key_column_usage kcu
ON tc.constraint_name = kcu.constraint_name
WHERE tc.constraint_type = 'PRIMARY KEY'
AND tc.table_schema = 'analytics';
-- Foreign keys
SELECT
tc.constraint_name,
tc.table_schema,
tc.table_name,
kcu.column_name,
ccu.table_schema AS foreign_table_schema,
ccu.table_name AS foreign_table_name,
ccu.column_name AS foreign_column_name
FROM information_schema.table_constraints tc
JOIN information_schema.key_column_usage kcu
ON tc.constraint_name = kcu.constraint_name
JOIN information_schema.constraint_column_usage ccu
ON ccu.constraint_name = tc.constraint_name
WHERE tc.constraint_type = 'FOREIGN KEY';
External Table Metadata
Capture Redshift Spectrum tables:
-- External schemas
SELECT
schemaname,
databasename,
esoptions
FROM svv_external_schemas;
-- External tables
SELECT
schemaname,
tablename,
location,
input_format,
output_format
FROM svv_external_tables;
-- External columns
SELECT
schemaname,
tablename,
columnname,
external_type,
columnnum
FROM svv_external_columns
ORDER BY schemaname, tablename, columnnum;
Query Pattern Analysis
Understand data usage:
-- Recent query patterns
SELECT
userid,
query,
querytxt,
starttime,
endtime
FROM stl_query
WHERE starttime > DATEADD(day, -30, GETDATE())
AND querytxt LIKE '%analytics.%'
ORDER BY starttime DESC
LIMIT 1000;
-- Table access frequency
SELECT
table_id,
"table" as table_name,
pct_mem_used,
rows,
unsorted_rows
FROM svv_table_info
ORDER BY rows DESC;
Building Semantic Models
Table to Entity Mapping
Transform Redshift metadata into semantic concepts:
Tables map to entities:
- Distribution style informs join optimization
- Sort keys indicate common query patterns
- Row counts help with cardinality understanding
Columns become attributes:
- VARCHAR becomes text dimension
- INTEGER, DECIMAL become measures
- DATE, TIMESTAMP become time dimensions
- Encoding hints at column usage patterns
Relationship Modeling
Leverage constraint metadata:
Primary keys define unique identifiers:
- Map to entity primary keys
- Enable record-level operations
Foreign keys define relationships:
- Map to semantic layer joins
- Inform navigation between entities
Inferred relationships fill gaps:
- Match column names across tables
- Analyze query join patterns
- Review distribution key relationships
Distribution-Aware Semantics
Optimize semantic queries using distribution metadata:
KEY distribution:
- Join on distribution key for collocated joins
- Semantic layer should prefer these join patterns
ALL distribution:
- Small tables replicated to all nodes
- Efficient for broadcast joins
EVEN distribution:
- Data spread across nodes
- Consider for large fact tables
Implementation Approach
Step 1: Configure Connection
Set up secure access to Redshift:
import psycopg2
conn = psycopg2.connect(
host='cluster.region.redshift.amazonaws.com',
port=5439,
database='analytics',
user='extraction_user',
password='secure_password'
)
Consider using IAM authentication for enhanced security.
Step 2: Extract Schema Metadata
Gather comprehensive schema information:
def extract_tables(conn, schema):
query = """
SELECT table_name, table_type, remarks
FROM svv_tables
WHERE schema_name = %s
"""
cursor = conn.cursor()
cursor.execute(query, (schema,))
return cursor.fetchall()
def extract_columns(conn, schema, table):
query = """
SELECT column_name, data_type, is_nullable,
ordinal_position
FROM svv_columns
WHERE table_schema = %s AND table_name = %s
ORDER BY ordinal_position
"""
cursor = conn.cursor()
cursor.execute(query, (schema, table))
return cursor.fetchall()
Step 3: Capture Relationships
Extract constraint information:
def extract_relationships(conn, schema):
query = """
SELECT
tc.table_name,
kcu.column_name,
ccu.table_name AS referenced_table,
ccu.column_name AS referenced_column
FROM information_schema.table_constraints tc
JOIN information_schema.key_column_usage kcu
ON tc.constraint_name = kcu.constraint_name
JOIN information_schema.constraint_column_usage ccu
ON ccu.constraint_name = tc.constraint_name
WHERE tc.constraint_type = 'FOREIGN KEY'
AND tc.table_schema = %s
"""
cursor = conn.cursor()
cursor.execute(query, (schema,))
return cursor.fetchall()
Step 4: Generate Semantic Model
Transform metadata into semantic definitions:
entity: Orders
source: analytics.orders
description: "Customer order transactions"
distribution_key: customer_id
sort_key: order_date
dimensions:
- name: order_date
source_column: order_date
type: time
- name: order_status
source_column: status
type: categorical
measures:
- name: revenue
calculation: SUM(order_total)
format: currency
relationships:
- name: customer
target_entity: Customers
join_type: many_to_one
join_columns:
- source: customer_id
target: customer_id
Step 5: Schedule Synchronization
Maintain metadata currency:
- Run extraction during maintenance windows
- Compare against previous extraction
- Update semantic models incrementally
- Alert on schema changes
Common Challenges
Cross-Database Metadata
Redshift allows cross-database queries:
Solutions:
- Extract from each database
- Use datashare metadata for shared data
- Aggregate into unified semantic layer
External Table Integration
Spectrum tables have different metadata patterns:
Solutions:
- Query svv_external_* views
- Map external types to internal types
- Handle location and format information
Performance Considerations
Large clusters with many objects:
Solutions:
- Limit extraction to relevant schemas
- Use incremental extraction
- Cache metadata locally
- Schedule during low-usage periods
The Value of Redshift Metadata Extraction
Automated metadata extraction from Redshift provides:
Comprehensive visibility: Understand all schemas, tables, and relationships.
Performance optimization: Leverage distribution and sort keys.
Relationship discovery: Map foreign keys to semantic joins.
Spectrum integration: Include external data in semantic layer.
Organizations that automate Redshift metadata extraction build semantic layers that generate optimized queries - enabling AI-powered analytics that are both accurate and performant across your data warehouse.
Questions
Redshift provides metadata through PG_CATALOG system tables (PostgreSQL-compatible) and SVV_ views. Key sources include svv_tables, svv_columns, svv_table_info for table metadata, and pg_constraint for relationships. The information_schema views also provide standard SQL metadata access.