Metadata Extraction from Redshift: Building Semantic Layers from Your Data Warehouse

Learn how to extract metadata from Amazon Redshift for semantic layers, including system catalog queries, schema discovery, and relationship identification for AI-powered analytics.

6 min read·

Metadata extraction from Amazon Redshift is the process of programmatically reading schema structures, table definitions, column information, and relationships from your Redshift data warehouse to build intelligent semantic layers. Redshift provides comprehensive system catalogs and views that expose detailed metadata about your data warehouse structure, enabling automated discovery and semantic model generation.

This extracted metadata forms the foundation for AI-powered analytics, allowing semantic layers to understand your data warehouse and generate accurate queries.

Redshift Metadata Architecture

System Catalog Structure

Redshift provides metadata through multiple interfaces:

PostgreSQL system catalogs (pg_catalog):

  • pg_tables, pg_views
  • pg_columns
  • pg_constraint
  • pg_namespace

Redshift system views (svv_):

  • svv_tables, svv_columns
  • svv_table_info
  • svv_external_tables
  • svv_all_schemas

Information schema:

  • information_schema.tables
  • information_schema.columns
  • information_schema.table_constraints

Metadata Components

Redshift stores rich metadata:

  • Database and schema organization
  • Table definitions and storage properties
  • Column names, types, and encoding
  • Distribution and sort key configuration
  • Primary and foreign key constraints
  • Table statistics and row counts
  • Query history and access patterns

Codd AI Integrations connect to Redshift system catalogs to extract this metadata automatically, building semantic layers that understand your data warehouse structure and optimize query generation.

Metadata Extraction Queries

Schema and Table Discovery

Query available schemas and tables:

-- List all schemas
SELECT schema_name, schema_owner
FROM svv_all_schemas
WHERE schema_name NOT IN ('pg_catalog', 'information_schema');

-- Discover tables with details
SELECT
    schema_name,
    table_name,
    table_type,
    table_owner,
    remarks
FROM svv_tables
WHERE schema_name = 'analytics';

-- Table statistics and storage
SELECT
    schema as schema_name,
    "table" as table_name,
    diststyle,
    sortkey1,
    tbl_rows,
    size as size_mb
FROM svv_table_info
WHERE schema = 'analytics';

Column Metadata

Extract detailed column information:

-- Column definitions
SELECT
    table_schema,
    table_name,
    column_name,
    ordinal_position,
    data_type,
    character_maximum_length,
    numeric_precision,
    is_nullable,
    column_default
FROM svv_columns
WHERE table_schema = 'analytics'
ORDER BY table_name, ordinal_position;

-- Column encoding and compression
SELECT
    schemaname,
    tablename,
    column,
    type,
    encoding,
    distkey,
    sortkey
FROM pg_table_def
WHERE schemaname = 'analytics';

Relationship Discovery

Identify table relationships:

-- Primary keys
SELECT
    tc.constraint_name,
    tc.table_schema,
    tc.table_name,
    kcu.column_name
FROM information_schema.table_constraints tc
JOIN information_schema.key_column_usage kcu
    ON tc.constraint_name = kcu.constraint_name
WHERE tc.constraint_type = 'PRIMARY KEY'
    AND tc.table_schema = 'analytics';

-- Foreign keys
SELECT
    tc.constraint_name,
    tc.table_schema,
    tc.table_name,
    kcu.column_name,
    ccu.table_schema AS foreign_table_schema,
    ccu.table_name AS foreign_table_name,
    ccu.column_name AS foreign_column_name
FROM information_schema.table_constraints tc
JOIN information_schema.key_column_usage kcu
    ON tc.constraint_name = kcu.constraint_name
JOIN information_schema.constraint_column_usage ccu
    ON ccu.constraint_name = tc.constraint_name
WHERE tc.constraint_type = 'FOREIGN KEY';

External Table Metadata

Capture Redshift Spectrum tables:

-- External schemas
SELECT
    schemaname,
    databasename,
    esoptions
FROM svv_external_schemas;

-- External tables
SELECT
    schemaname,
    tablename,
    location,
    input_format,
    output_format
FROM svv_external_tables;

-- External columns
SELECT
    schemaname,
    tablename,
    columnname,
    external_type,
    columnnum
FROM svv_external_columns
ORDER BY schemaname, tablename, columnnum;

Query Pattern Analysis

Understand data usage:

-- Recent query patterns
SELECT
    userid,
    query,
    querytxt,
    starttime,
    endtime
FROM stl_query
WHERE starttime > DATEADD(day, -30, GETDATE())
    AND querytxt LIKE '%analytics.%'
ORDER BY starttime DESC
LIMIT 1000;

-- Table access frequency
SELECT
    table_id,
    "table" as table_name,
    pct_mem_used,
    rows,
    unsorted_rows
FROM svv_table_info
ORDER BY rows DESC;

Building Semantic Models

Table to Entity Mapping

Transform Redshift metadata into semantic concepts:

Tables map to entities:

  • Distribution style informs join optimization
  • Sort keys indicate common query patterns
  • Row counts help with cardinality understanding

Columns become attributes:

  • VARCHAR becomes text dimension
  • INTEGER, DECIMAL become measures
  • DATE, TIMESTAMP become time dimensions
  • Encoding hints at column usage patterns

Relationship Modeling

Leverage constraint metadata:

Primary keys define unique identifiers:

  • Map to entity primary keys
  • Enable record-level operations

Foreign keys define relationships:

  • Map to semantic layer joins
  • Inform navigation between entities

Inferred relationships fill gaps:

  • Match column names across tables
  • Analyze query join patterns
  • Review distribution key relationships

Distribution-Aware Semantics

Optimize semantic queries using distribution metadata:

KEY distribution:

  • Join on distribution key for collocated joins
  • Semantic layer should prefer these join patterns

ALL distribution:

  • Small tables replicated to all nodes
  • Efficient for broadcast joins

EVEN distribution:

  • Data spread across nodes
  • Consider for large fact tables

Implementation Approach

Step 1: Configure Connection

Set up secure access to Redshift:

import psycopg2

conn = psycopg2.connect(
    host='cluster.region.redshift.amazonaws.com',
    port=5439,
    database='analytics',
    user='extraction_user',
    password='secure_password'
)

Consider using IAM authentication for enhanced security.

Step 2: Extract Schema Metadata

Gather comprehensive schema information:

def extract_tables(conn, schema):
    query = """
        SELECT table_name, table_type, remarks
        FROM svv_tables
        WHERE schema_name = %s
    """
    cursor = conn.cursor()
    cursor.execute(query, (schema,))
    return cursor.fetchall()

def extract_columns(conn, schema, table):
    query = """
        SELECT column_name, data_type, is_nullable,
               ordinal_position
        FROM svv_columns
        WHERE table_schema = %s AND table_name = %s
        ORDER BY ordinal_position
    """
    cursor = conn.cursor()
    cursor.execute(query, (schema, table))
    return cursor.fetchall()

Step 3: Capture Relationships

Extract constraint information:

def extract_relationships(conn, schema):
    query = """
        SELECT
            tc.table_name,
            kcu.column_name,
            ccu.table_name AS referenced_table,
            ccu.column_name AS referenced_column
        FROM information_schema.table_constraints tc
        JOIN information_schema.key_column_usage kcu
            ON tc.constraint_name = kcu.constraint_name
        JOIN information_schema.constraint_column_usage ccu
            ON ccu.constraint_name = tc.constraint_name
        WHERE tc.constraint_type = 'FOREIGN KEY'
            AND tc.table_schema = %s
    """
    cursor = conn.cursor()
    cursor.execute(query, (schema,))
    return cursor.fetchall()

Step 4: Generate Semantic Model

Transform metadata into semantic definitions:

entity: Orders
  source: analytics.orders
  description: "Customer order transactions"
  distribution_key: customer_id
  sort_key: order_date

  dimensions:
    - name: order_date
      source_column: order_date
      type: time

    - name: order_status
      source_column: status
      type: categorical

  measures:
    - name: revenue
      calculation: SUM(order_total)
      format: currency

  relationships:
    - name: customer
      target_entity: Customers
      join_type: many_to_one
      join_columns:
        - source: customer_id
          target: customer_id

Step 5: Schedule Synchronization

Maintain metadata currency:

  • Run extraction during maintenance windows
  • Compare against previous extraction
  • Update semantic models incrementally
  • Alert on schema changes

Common Challenges

Cross-Database Metadata

Redshift allows cross-database queries:

Solutions:

  • Extract from each database
  • Use datashare metadata for shared data
  • Aggregate into unified semantic layer

External Table Integration

Spectrum tables have different metadata patterns:

Solutions:

  • Query svv_external_* views
  • Map external types to internal types
  • Handle location and format information

Performance Considerations

Large clusters with many objects:

Solutions:

  • Limit extraction to relevant schemas
  • Use incremental extraction
  • Cache metadata locally
  • Schedule during low-usage periods

The Value of Redshift Metadata Extraction

Automated metadata extraction from Redshift provides:

Comprehensive visibility: Understand all schemas, tables, and relationships.

Performance optimization: Leverage distribution and sort keys.

Relationship discovery: Map foreign keys to semantic joins.

Spectrum integration: Include external data in semantic layer.

Organizations that automate Redshift metadata extraction build semantic layers that generate optimized queries - enabling AI-powered analytics that are both accurate and performant across your data warehouse.

Questions

Redshift provides metadata through PG_CATALOG system tables (PostgreSQL-compatible) and SVV_ views. Key sources include svv_tables, svv_columns, svv_table_info for table metadata, and pg_constraint for relationships. The information_schema views also provide standard SQL metadata access.

Related