Metadata Extraction from Trino: Federated Query Engine Schema Discovery
Learn how to extract metadata from Trino (formerly PrestoSQL) for building semantic layers, including catalog discovery, cross-source schema mapping, and unified metadata for federated analytics.
Metadata extraction from Trino is the process of programmatically reading schema structures, table definitions, and column information across all connected data sources through Trino's federated query engine. As a distributed SQL query engine, Trino provides unified metadata access across diverse sources - from data lakes to traditional databases - enabling semantic layers that span your entire data ecosystem.
This federated metadata forms the foundation for AI-powered analytics, allowing semantic layers to understand and query data regardless of where it physically resides.
Trino Metadata Architecture
Catalog-Schema-Table Hierarchy
Trino organizes data in a three-level hierarchy:
Trino Cluster
└── Catalog (connector to data source)
└── Schema
└── Tables, Views
Each catalog represents a connection to a data source through a connector.
Available Connectors
Trino connects to diverse sources:
- Hive/Iceberg/Delta Lake (data lakes)
- MySQL, PostgreSQL, SQL Server
- MongoDB, Cassandra, Redis
- Kafka, Kinesis (streaming)
- Google Sheets, REST APIs
Codd AI Integrations connect to Trino to extract unified metadata across all your data sources, building semantic layers that federate intelligently across your data ecosystem.
Metadata Extraction Queries
Catalog Discovery
Discover available data sources:
-- List all catalogs
SELECT catalog_name
FROM system.metadata.catalogs
ORDER BY catalog_name;
-- Catalog properties
SHOW CATALOGS;
-- Schemas in a catalog
SELECT schema_name
FROM hive.information_schema.schemata;
-- All schemas across catalogs
SELECT
table_catalog as catalog_name,
table_schema as schema_name
FROM system.information_schema.schemata
ORDER BY catalog_name, schema_name;
Table Discovery
Find tables across sources:
-- Tables in a specific catalog schema
SELECT
table_catalog,
table_schema,
table_name,
table_type
FROM hive.information_schema.tables
WHERE table_schema = 'analytics';
-- Tables across all catalogs
SELECT
table_catalog,
table_schema,
table_name,
table_type
FROM system.information_schema.tables
WHERE table_schema NOT IN ('information_schema')
ORDER BY table_catalog, table_schema, table_name;
-- Table properties (connector-specific)
SHOW CREATE TABLE hive.analytics.customers;
-- Hive-specific partitions
SELECT * FROM hive.analytics."customers$partitions";
Column Metadata
Extract column information:
-- Columns in a table
SELECT
column_name,
ordinal_position,
data_type,
is_nullable,
column_default,
extra_info
FROM hive.information_schema.columns
WHERE table_schema = 'analytics'
AND table_name = 'customers'
ORDER BY ordinal_position;
-- Columns across catalogs
SELECT
table_catalog,
table_schema,
table_name,
column_name,
data_type,
is_nullable
FROM system.information_schema.columns
WHERE table_schema NOT IN ('information_schema')
ORDER BY table_catalog, table_schema, table_name, ordinal_position;
-- Column comments (when available)
DESCRIBE hive.analytics.customers;
View Definitions
Capture view logic:
-- View definitions
SELECT
table_catalog,
table_schema,
table_name,
view_definition
FROM system.information_schema.views
WHERE table_schema = 'analytics';
Connector-Specific Metadata
Extract source-specific information:
-- Hive table statistics
SHOW STATS FOR hive.analytics.customers;
-- Hive table properties
SELECT * FROM hive.analytics."customers$properties";
-- Iceberg table history
SELECT * FROM iceberg.analytics."customers$history";
-- Iceberg snapshots
SELECT * FROM iceberg.analytics."customers$snapshots";
-- Delta Lake history
SELECT * FROM delta.analytics."customers$history";
Building Semantic Models
Cross-Catalog Entity Mapping
Map entities from multiple sources:
# Entity from Hive data lake
entity: Customers
source: hive.analytics.customers
description: "Customer master data from data lake"
# Entity from PostgreSQL operational database
entity: Orders
source: postgres.operations.orders
description: "Order transactions from operational system"
# Cross-catalog relationship
relationships:
- name: customer_orders
from_entity: Customers
to_entity: Orders
join:
- source: customer_id
target: customer_id
Unified Data Types
Normalize types across sources:
| Source Type | Trino Type | Semantic Type |
|---|---|---|
| Hive STRING | VARCHAR | text |
| MySQL INT | INTEGER | measure |
| PostgreSQL TIMESTAMP | TIMESTAMP | time |
| Iceberg DECIMAL | DECIMAL | measure |
Federation-Aware Semantics
Design for cross-source queries:
entity: CustomerAnalytics
description: "Unified customer view across systems"
sources:
- catalog: hive
schema: analytics
table: customer_events
- catalog: postgres
schema: crm
table: customer_profiles
# Semantic layer handles federation complexity
join_strategy: left_outer
primary_source: hive.analytics.customer_events
Implementation Approach
Step 1: Configure Connection
Set up Trino connectivity:
from trino.dbapi import connect
conn = connect(
host='trino-coordinator',
port=443,
user='metadata_reader',
catalog='system',
schema='information_schema',
http_scheme='https',
auth=BasicAuthentication('user', 'password')
)
Step 2: Enumerate Catalogs
Discover all connected sources:
def get_catalogs(conn):
"""List all available catalogs."""
cursor = conn.cursor()
cursor.execute("SELECT catalog_name FROM system.metadata.catalogs")
return [row[0] for row in cursor.fetchall()]
def get_schemas(conn, catalog):
"""List schemas in a catalog."""
cursor = conn.cursor()
cursor.execute(f"""
SELECT schema_name
FROM {catalog}.information_schema.schemata
WHERE schema_name NOT IN ('information_schema')
""")
return [row[0] for row in cursor.fetchall()]
Step 3: Extract Schema Metadata
Gather comprehensive metadata:
def extract_tables(conn, catalog, schema):
"""Extract table metadata from a catalog schema."""
cursor = conn.cursor()
cursor.execute(f"""
SELECT table_name, table_type
FROM {catalog}.information_schema.tables
WHERE table_schema = '{schema}'
""")
return cursor.fetchall()
def extract_columns(conn, catalog, schema, table):
"""Extract column metadata."""
cursor = conn.cursor()
cursor.execute(f"""
SELECT column_name, data_type, is_nullable,
ordinal_position
FROM {catalog}.information_schema.columns
WHERE table_schema = '{schema}'
AND table_name = '{table}'
ORDER BY ordinal_position
""")
return cursor.fetchall()
Step 4: Handle Connector Variations
Account for different connectors:
def get_connector_type(catalog_name, conn):
"""Determine connector type for specialized extraction."""
cursor = conn.cursor()
cursor.execute(f"SHOW CREATE CATALOG {catalog_name}")
definition = cursor.fetchone()[0]
if 'hive' in definition.lower():
return 'hive'
elif 'iceberg' in definition.lower():
return 'iceberg'
elif 'postgresql' in definition.lower():
return 'postgresql'
# Additional connector detection
return 'generic'
def extract_partitions(conn, catalog, schema, table, connector_type):
"""Extract partition info if available."""
if connector_type == 'hive':
cursor = conn.cursor()
cursor.execute(f"""
SELECT * FROM {catalog}.{schema}."{table}$partitions"
""")
return cursor.fetchall()
return None
Step 5: Generate Unified Semantic Model
Create cross-source semantic definitions:
semantic_model:
name: enterprise_analytics
catalogs:
- name: hive
type: data_lake
schemas: [analytics, staging]
- name: postgres
type: operational
schemas: [public, crm]
entities:
- name: Customers
source: hive.analytics.customers
primary_key: customer_id
dimensions:
- name: customer_name
source_column: name
type: text
- name: signup_date
source_column: created_at
type: time
measures:
- name: customer_count
calculation: COUNT(DISTINCT customer_id)
- name: Transactions
source: postgres.operations.transactions
primary_key: transaction_id
relationships:
- name: customer
target: Customers
join:
source: customer_id
target: customer_id
Common Challenges
Inconsistent Metadata Richness
Different connectors expose varying metadata:
Solutions:
- Implement connector-specific extraction
- Supplement with manual documentation
- Use semantic layer as metadata authority
- Normalize across sources
Missing Relationships
Data lake sources lack referential integrity:
Solutions:
- Infer from naming conventions
- Analyze query patterns
- Define relationships in semantic layer
- Document business relationships
Cross-Catalog Performance
Federated queries can be slow:
Solutions:
- Understand query pushdown capabilities
- Design semantic models for efficient federation
- Cache frequently joined data
- Use materialized views where possible
Schema Evolution
Sources evolve independently:
Solutions:
- Schedule regular metadata extraction
- Compare schemas over time
- Alert on breaking changes
- Version semantic models
The Value of Trino Metadata Extraction
Automated metadata extraction from Trino delivers:
Unified visibility: See all data sources through one interface.
Federation intelligence: Build semantic layers that span sources.
Source transparency: Hide complexity while preserving context.
Modern architecture: Enable lakehouse and multi-source analytics.
Organizations that automate Trino metadata extraction build semantic layers that truly unify their data ecosystem - enabling AI-powered analytics that seamlessly query across data lakes, databases, and streaming sources.
Questions
Trino federates across multiple data sources through connectors. Each connector exposes its source's schema through Trino's metadata interface. The information_schema provides unified access, but metadata richness varies by connector - some provide detailed statistics while others offer basic schema information.