Metadata Extraction from PostgreSQL: Schema Discovery for Semantic Layers

Learn how to extract metadata from PostgreSQL databases for building semantic layers, including catalog queries, relationship discovery, and leveraging PostgreSQL's rich constraint system.

7 min read·

Metadata extraction from PostgreSQL is the process of programmatically reading schema structures, table definitions, column information, and relationships from your PostgreSQL database to build intelligent semantic layers. PostgreSQL provides one of the richest system catalogs available, with comprehensive information about tables, columns, constraints, and documentation that enables sophisticated semantic model generation.

This metadata forms the foundation for AI-powered analytics, allowing semantic layers to understand your database structure and generate accurate, optimized queries.

PostgreSQL Metadata Architecture

System Catalog Structure

PostgreSQL organizes metadata in pg_catalog:

Core catalog tables:

  • pg_class - tables, views, indexes, sequences
  • pg_attribute - columns
  • pg_type - data types
  • pg_namespace - schemas
  • pg_constraint - constraints
  • pg_description - comments
  • pg_index - index information

Information schema:

  • Standard SQL interface to metadata
  • More portable but less detailed
  • Good for cross-database compatibility

Metadata Richness

PostgreSQL stores extensive metadata:

  • Schema organization
  • Table and view definitions
  • Column types with full precision
  • Primary, foreign, unique, and check constraints
  • Indexes and their configurations
  • Comments at all object levels
  • Partition information
  • Inheritance relationships

Codd AI Integrations connect to PostgreSQL system catalogs to extract this rich metadata, building semantic layers that fully understand your database structure.

Metadata Extraction Queries

Schema and Table Discovery

Discover database structure:

-- List all schemas
SELECT schema_name
FROM information_schema.schemata
WHERE schema_name NOT IN ('pg_catalog', 'information_schema', 'pg_toast');

-- Discover tables and views
SELECT
    table_schema,
    table_name,
    table_type,
    obj_description(
        (quote_ident(table_schema) || '.' || quote_ident(table_name))::regclass
    ) as table_comment
FROM information_schema.tables
WHERE table_schema = 'public'
ORDER BY table_name;

-- Table sizes and row estimates
SELECT
    schemaname,
    relname as table_name,
    n_live_tup as row_estimate,
    pg_size_pretty(pg_total_relation_size(relid)) as total_size
FROM pg_stat_user_tables
WHERE schemaname = 'public';

Column Metadata

Extract detailed column information:

-- Column definitions with comments
SELECT
    c.table_schema,
    c.table_name,
    c.column_name,
    c.ordinal_position,
    c.data_type,
    c.character_maximum_length,
    c.numeric_precision,
    c.numeric_scale,
    c.is_nullable,
    c.column_default,
    col_description(
        (c.table_schema || '.' || c.table_name)::regclass,
        c.ordinal_position
    ) as column_comment
FROM information_schema.columns c
WHERE c.table_schema = 'public'
ORDER BY c.table_name, c.ordinal_position;

-- Columns using pg_catalog for more detail
SELECT
    n.nspname as schema_name,
    c.relname as table_name,
    a.attname as column_name,
    a.attnum as position,
    pg_catalog.format_type(a.atttypid, a.atttypmod) as data_type,
    a.attnotnull as not_null,
    pg_get_expr(d.adbin, d.adrelid) as default_value
FROM pg_catalog.pg_attribute a
JOIN pg_catalog.pg_class c ON a.attrelid = c.oid
JOIN pg_catalog.pg_namespace n ON c.relnamespace = n.oid
LEFT JOIN pg_catalog.pg_attrdef d ON a.attrelid = d.adrelid AND a.attnum = d.adnum
WHERE n.nspname = 'public'
    AND a.attnum > 0
    AND NOT a.attisdropped
ORDER BY c.relname, a.attnum;

Constraint Discovery

Extract relationship information:

-- Primary keys
SELECT
    tc.table_schema,
    tc.table_name,
    kcu.column_name,
    tc.constraint_name
FROM information_schema.table_constraints tc
JOIN information_schema.key_column_usage kcu
    ON tc.constraint_name = kcu.constraint_name
    AND tc.table_schema = kcu.table_schema
WHERE tc.constraint_type = 'PRIMARY KEY'
    AND tc.table_schema = 'public';

-- Foreign keys with referenced tables
SELECT
    tc.table_schema,
    tc.table_name,
    kcu.column_name,
    ccu.table_schema AS foreign_table_schema,
    ccu.table_name AS foreign_table_name,
    ccu.column_name AS foreign_column_name,
    tc.constraint_name
FROM information_schema.table_constraints tc
JOIN information_schema.key_column_usage kcu
    ON tc.constraint_name = kcu.constraint_name
    AND tc.table_schema = kcu.table_schema
JOIN information_schema.constraint_column_usage ccu
    ON tc.constraint_name = ccu.constraint_name
WHERE tc.constraint_type = 'FOREIGN KEY'
    AND tc.table_schema = 'public';

-- Unique constraints
SELECT
    tc.table_name,
    kcu.column_name,
    tc.constraint_name
FROM information_schema.table_constraints tc
JOIN information_schema.key_column_usage kcu
    ON tc.constraint_name = kcu.constraint_name
WHERE tc.constraint_type = 'UNIQUE'
    AND tc.table_schema = 'public';

Index Information

Capture index metadata:

-- Index definitions
SELECT
    schemaname,
    tablename,
    indexname,
    indexdef
FROM pg_indexes
WHERE schemaname = 'public';

-- Detailed index columns
SELECT
    i.relname as index_name,
    t.relname as table_name,
    a.attname as column_name,
    ix.indisunique as is_unique,
    ix.indisprimary as is_primary
FROM pg_class t
JOIN pg_index ix ON t.oid = ix.indrelid
JOIN pg_class i ON i.oid = ix.indexrelid
JOIN pg_attribute a ON a.attrelid = t.oid
    AND a.attnum = ANY(ix.indkey)
JOIN pg_namespace n ON n.oid = t.relnamespace
WHERE n.nspname = 'public';

Partition Information

Handle partitioned tables:

-- Partitioned tables
SELECT
    n.nspname as schema_name,
    c.relname as table_name,
    pg_get_partkeydef(c.oid) as partition_key
FROM pg_partitioned_table pt
JOIN pg_class c ON pt.partrelid = c.oid
JOIN pg_namespace n ON c.relnamespace = n.oid;

-- Partition children
SELECT
    parent.relname as parent_table,
    child.relname as partition_name,
    pg_get_expr(child.relpartbound, child.oid) as partition_bounds
FROM pg_inherits
JOIN pg_class parent ON pg_inherits.inhparent = parent.oid
JOIN pg_class child ON pg_inherits.inhrelid = child.oid
JOIN pg_namespace n ON parent.relnamespace = n.oid
WHERE n.nspname = 'public';

Building Semantic Models

Table to Entity Mapping

Transform PostgreSQL metadata into semantic concepts:

Tables become entities:

  • Primary key defines entity identity
  • Table comment provides description
  • Partition parent represents logical entity

Columns map to attributes:

  • text, varchar become text dimensions
  • integer, bigint, numeric become measures
  • date, timestamp become time dimensions
  • boolean becomes categorical dimension

Relationship Mapping

Leverage PostgreSQL's rich constraint system:

Foreign keys directly map to semantic relationships:

entity: Orders
  relationships:
    - name: customer
      type: many_to_one
      target: Customers
      # Derived from foreign key constraint
      join:
        source_column: customer_id
        target_column: id

Unique constraints indicate dimension tables:

  • Tables with unique keys are often dimensions
  • Natural keys suggest business identifiers

Comment Utilization

Leverage PostgreSQL comments:

entity: Customers
  # From table comment
  description: "Active customer accounts with subscription status"

  attributes:
    - name: email
      source_column: email
      # From column comment
      description: "Primary contact email, validated format"

Implementation Guide

Step 1: Configure Connection

Set up secure database access:

import psycopg2

conn = psycopg2.connect(
    host='localhost',
    port=5432,
    database='analytics',
    user='metadata_reader',
    password='secure_password'
)

Create a dedicated user with minimal permissions:

CREATE USER metadata_reader WITH PASSWORD 'secure_password';
GRANT USAGE ON SCHEMA public TO metadata_reader;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO metadata_reader;

Step 2: Extract Comprehensive Metadata

Build complete schema picture:

def extract_schema_metadata(conn, schema='public'):
    """Extract complete schema metadata."""
    metadata = {
        'tables': extract_tables(conn, schema),
        'columns': extract_columns(conn, schema),
        'primary_keys': extract_primary_keys(conn, schema),
        'foreign_keys': extract_foreign_keys(conn, schema),
        'indexes': extract_indexes(conn, schema)
    }
    return metadata

def extract_tables(conn, schema):
    query = """
        SELECT table_name, table_type,
               obj_description((quote_ident(%s) || '.' ||
                   quote_ident(table_name))::regclass) as comment
        FROM information_schema.tables
        WHERE table_schema = %s
    """
    cursor = conn.cursor()
    cursor.execute(query, (schema, schema))
    return cursor.fetchall()

Step 3: Handle PostgreSQL-Specific Features

Account for PostgreSQL capabilities:

Array types:

# Detect array columns
if data_type.endswith('[]'):
    # Handle as multi-value attribute
    pass

JSON/JSONB columns:

# Identify JSON columns for special handling
if data_type in ('json', 'jsonb'):
    # May need path-based attribute extraction
    pass

Step 4: Generate Semantic Model

Transform to semantic definitions:

entity: Customers
  source: public.customers
  description: "Active customer accounts with subscription status"
  primary_key: id

  dimensions:
    - name: customer_name
      source_column: name
      type: text
      description: "Full legal name"

    - name: signup_date
      source_column: created_at
      type: time
      grain: day

  measures:
    - name: customer_count
      calculation: COUNT(DISTINCT id)

  relationships:
    - name: orders
      target: Orders
      type: one_to_many
      join_column: id
      foreign_column: customer_id

Step 5: Maintain Synchronization

Keep semantic layer current:

  • Monitor pg_stat_activity for DDL changes
  • Compare schema versions periodically
  • Update semantic models on change detection
  • Validate queries after updates

Common Challenges

Schema Proliferation

Many schemas with overlapping tables:

Solutions:

  • Focus on production schemas
  • Document schema purposes
  • Create unified semantic models

Missing Documentation

Limited table and column comments:

Solutions:

  • Encourage comment creation
  • Generate AI-suggested descriptions
  • Layer documentation in semantic model

Complex Data Types

PostgreSQL's rich type system:

Solutions:

  • Map composite types to structured attributes
  • Handle arrays appropriately
  • Document JSON path patterns

The Value of PostgreSQL Metadata Extraction

Automated metadata extraction from PostgreSQL delivers:

Relationship richness: Leverage PostgreSQL's constraint system.

Documentation integration: Use comments for semantic descriptions.

Type intelligence: Map PostgreSQL's rich types accurately.

Complete visibility: Understand all schemas and objects.

Organizations that automate PostgreSQL metadata extraction build semantic layers that fully leverage their database's relational structure - enabling AI-powered analytics that understand and respect data relationships.

Questions

The pg_catalog schema contains essential tables including pg_class (tables/views), pg_attribute (columns), pg_constraint (constraints), pg_namespace (schemas), and pg_description (comments). The information_schema provides a more SQL-standard interface to similar information.

Related