Metadata Extraction from PostgreSQL: Schema Discovery for Semantic Layers
Learn how to extract metadata from PostgreSQL databases for building semantic layers, including catalog queries, relationship discovery, and leveraging PostgreSQL's rich constraint system.
Metadata extraction from PostgreSQL is the process of programmatically reading schema structures, table definitions, column information, and relationships from your PostgreSQL database to build intelligent semantic layers. PostgreSQL provides one of the richest system catalogs available, with comprehensive information about tables, columns, constraints, and documentation that enables sophisticated semantic model generation.
This metadata forms the foundation for AI-powered analytics, allowing semantic layers to understand your database structure and generate accurate, optimized queries.
PostgreSQL Metadata Architecture
System Catalog Structure
PostgreSQL organizes metadata in pg_catalog:
Core catalog tables:
- pg_class - tables, views, indexes, sequences
- pg_attribute - columns
- pg_type - data types
- pg_namespace - schemas
- pg_constraint - constraints
- pg_description - comments
- pg_index - index information
Information schema:
- Standard SQL interface to metadata
- More portable but less detailed
- Good for cross-database compatibility
Metadata Richness
PostgreSQL stores extensive metadata:
- Schema organization
- Table and view definitions
- Column types with full precision
- Primary, foreign, unique, and check constraints
- Indexes and their configurations
- Comments at all object levels
- Partition information
- Inheritance relationships
Codd AI Integrations connect to PostgreSQL system catalogs to extract this rich metadata, building semantic layers that fully understand your database structure.
Metadata Extraction Queries
Schema and Table Discovery
Discover database structure:
-- List all schemas
SELECT schema_name
FROM information_schema.schemata
WHERE schema_name NOT IN ('pg_catalog', 'information_schema', 'pg_toast');
-- Discover tables and views
SELECT
table_schema,
table_name,
table_type,
obj_description(
(quote_ident(table_schema) || '.' || quote_ident(table_name))::regclass
) as table_comment
FROM information_schema.tables
WHERE table_schema = 'public'
ORDER BY table_name;
-- Table sizes and row estimates
SELECT
schemaname,
relname as table_name,
n_live_tup as row_estimate,
pg_size_pretty(pg_total_relation_size(relid)) as total_size
FROM pg_stat_user_tables
WHERE schemaname = 'public';
Column Metadata
Extract detailed column information:
-- Column definitions with comments
SELECT
c.table_schema,
c.table_name,
c.column_name,
c.ordinal_position,
c.data_type,
c.character_maximum_length,
c.numeric_precision,
c.numeric_scale,
c.is_nullable,
c.column_default,
col_description(
(c.table_schema || '.' || c.table_name)::regclass,
c.ordinal_position
) as column_comment
FROM information_schema.columns c
WHERE c.table_schema = 'public'
ORDER BY c.table_name, c.ordinal_position;
-- Columns using pg_catalog for more detail
SELECT
n.nspname as schema_name,
c.relname as table_name,
a.attname as column_name,
a.attnum as position,
pg_catalog.format_type(a.atttypid, a.atttypmod) as data_type,
a.attnotnull as not_null,
pg_get_expr(d.adbin, d.adrelid) as default_value
FROM pg_catalog.pg_attribute a
JOIN pg_catalog.pg_class c ON a.attrelid = c.oid
JOIN pg_catalog.pg_namespace n ON c.relnamespace = n.oid
LEFT JOIN pg_catalog.pg_attrdef d ON a.attrelid = d.adrelid AND a.attnum = d.adnum
WHERE n.nspname = 'public'
AND a.attnum > 0
AND NOT a.attisdropped
ORDER BY c.relname, a.attnum;
Constraint Discovery
Extract relationship information:
-- Primary keys
SELECT
tc.table_schema,
tc.table_name,
kcu.column_name,
tc.constraint_name
FROM information_schema.table_constraints tc
JOIN information_schema.key_column_usage kcu
ON tc.constraint_name = kcu.constraint_name
AND tc.table_schema = kcu.table_schema
WHERE tc.constraint_type = 'PRIMARY KEY'
AND tc.table_schema = 'public';
-- Foreign keys with referenced tables
SELECT
tc.table_schema,
tc.table_name,
kcu.column_name,
ccu.table_schema AS foreign_table_schema,
ccu.table_name AS foreign_table_name,
ccu.column_name AS foreign_column_name,
tc.constraint_name
FROM information_schema.table_constraints tc
JOIN information_schema.key_column_usage kcu
ON tc.constraint_name = kcu.constraint_name
AND tc.table_schema = kcu.table_schema
JOIN information_schema.constraint_column_usage ccu
ON tc.constraint_name = ccu.constraint_name
WHERE tc.constraint_type = 'FOREIGN KEY'
AND tc.table_schema = 'public';
-- Unique constraints
SELECT
tc.table_name,
kcu.column_name,
tc.constraint_name
FROM information_schema.table_constraints tc
JOIN information_schema.key_column_usage kcu
ON tc.constraint_name = kcu.constraint_name
WHERE tc.constraint_type = 'UNIQUE'
AND tc.table_schema = 'public';
Index Information
Capture index metadata:
-- Index definitions
SELECT
schemaname,
tablename,
indexname,
indexdef
FROM pg_indexes
WHERE schemaname = 'public';
-- Detailed index columns
SELECT
i.relname as index_name,
t.relname as table_name,
a.attname as column_name,
ix.indisunique as is_unique,
ix.indisprimary as is_primary
FROM pg_class t
JOIN pg_index ix ON t.oid = ix.indrelid
JOIN pg_class i ON i.oid = ix.indexrelid
JOIN pg_attribute a ON a.attrelid = t.oid
AND a.attnum = ANY(ix.indkey)
JOIN pg_namespace n ON n.oid = t.relnamespace
WHERE n.nspname = 'public';
Partition Information
Handle partitioned tables:
-- Partitioned tables
SELECT
n.nspname as schema_name,
c.relname as table_name,
pg_get_partkeydef(c.oid) as partition_key
FROM pg_partitioned_table pt
JOIN pg_class c ON pt.partrelid = c.oid
JOIN pg_namespace n ON c.relnamespace = n.oid;
-- Partition children
SELECT
parent.relname as parent_table,
child.relname as partition_name,
pg_get_expr(child.relpartbound, child.oid) as partition_bounds
FROM pg_inherits
JOIN pg_class parent ON pg_inherits.inhparent = parent.oid
JOIN pg_class child ON pg_inherits.inhrelid = child.oid
JOIN pg_namespace n ON parent.relnamespace = n.oid
WHERE n.nspname = 'public';
Building Semantic Models
Table to Entity Mapping
Transform PostgreSQL metadata into semantic concepts:
Tables become entities:
- Primary key defines entity identity
- Table comment provides description
- Partition parent represents logical entity
Columns map to attributes:
- text, varchar become text dimensions
- integer, bigint, numeric become measures
- date, timestamp become time dimensions
- boolean becomes categorical dimension
Relationship Mapping
Leverage PostgreSQL's rich constraint system:
Foreign keys directly map to semantic relationships:
entity: Orders
relationships:
- name: customer
type: many_to_one
target: Customers
# Derived from foreign key constraint
join:
source_column: customer_id
target_column: id
Unique constraints indicate dimension tables:
- Tables with unique keys are often dimensions
- Natural keys suggest business identifiers
Comment Utilization
Leverage PostgreSQL comments:
entity: Customers
# From table comment
description: "Active customer accounts with subscription status"
attributes:
- name: email
source_column: email
# From column comment
description: "Primary contact email, validated format"
Implementation Guide
Step 1: Configure Connection
Set up secure database access:
import psycopg2
conn = psycopg2.connect(
host='localhost',
port=5432,
database='analytics',
user='metadata_reader',
password='secure_password'
)
Create a dedicated user with minimal permissions:
CREATE USER metadata_reader WITH PASSWORD 'secure_password';
GRANT USAGE ON SCHEMA public TO metadata_reader;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO metadata_reader;
Step 2: Extract Comprehensive Metadata
Build complete schema picture:
def extract_schema_metadata(conn, schema='public'):
"""Extract complete schema metadata."""
metadata = {
'tables': extract_tables(conn, schema),
'columns': extract_columns(conn, schema),
'primary_keys': extract_primary_keys(conn, schema),
'foreign_keys': extract_foreign_keys(conn, schema),
'indexes': extract_indexes(conn, schema)
}
return metadata
def extract_tables(conn, schema):
query = """
SELECT table_name, table_type,
obj_description((quote_ident(%s) || '.' ||
quote_ident(table_name))::regclass) as comment
FROM information_schema.tables
WHERE table_schema = %s
"""
cursor = conn.cursor()
cursor.execute(query, (schema, schema))
return cursor.fetchall()
Step 3: Handle PostgreSQL-Specific Features
Account for PostgreSQL capabilities:
Array types:
# Detect array columns
if data_type.endswith('[]'):
# Handle as multi-value attribute
pass
JSON/JSONB columns:
# Identify JSON columns for special handling
if data_type in ('json', 'jsonb'):
# May need path-based attribute extraction
pass
Step 4: Generate Semantic Model
Transform to semantic definitions:
entity: Customers
source: public.customers
description: "Active customer accounts with subscription status"
primary_key: id
dimensions:
- name: customer_name
source_column: name
type: text
description: "Full legal name"
- name: signup_date
source_column: created_at
type: time
grain: day
measures:
- name: customer_count
calculation: COUNT(DISTINCT id)
relationships:
- name: orders
target: Orders
type: one_to_many
join_column: id
foreign_column: customer_id
Step 5: Maintain Synchronization
Keep semantic layer current:
- Monitor pg_stat_activity for DDL changes
- Compare schema versions periodically
- Update semantic models on change detection
- Validate queries after updates
Common Challenges
Schema Proliferation
Many schemas with overlapping tables:
Solutions:
- Focus on production schemas
- Document schema purposes
- Create unified semantic models
Missing Documentation
Limited table and column comments:
Solutions:
- Encourage comment creation
- Generate AI-suggested descriptions
- Layer documentation in semantic model
Complex Data Types
PostgreSQL's rich type system:
Solutions:
- Map composite types to structured attributes
- Handle arrays appropriately
- Document JSON path patterns
The Value of PostgreSQL Metadata Extraction
Automated metadata extraction from PostgreSQL delivers:
Relationship richness: Leverage PostgreSQL's constraint system.
Documentation integration: Use comments for semantic descriptions.
Type intelligence: Map PostgreSQL's rich types accurately.
Complete visibility: Understand all schemas and objects.
Organizations that automate PostgreSQL metadata extraction build semantic layers that fully leverage their database's relational structure - enabling AI-powered analytics that understand and respect data relationships.
Questions
The pg_catalog schema contains essential tables including pg_class (tables/views), pg_attribute (columns), pg_constraint (constraints), pg_namespace (schemas), and pg_description (comments). The information_schema provides a more SQL-standard interface to similar information.