Metadata Extraction from Trino: Federated Query Engine Schema Discovery

Learn how to extract metadata from Trino (formerly PrestoSQL) for building semantic layers, including catalog discovery, cross-source schema mapping, and unified metadata for federated analytics.

6 min read·

Metadata extraction from Trino is the process of programmatically reading schema structures, table definitions, and column information across all connected data sources through Trino's federated query engine. As a distributed SQL query engine, Trino provides unified metadata access across diverse sources - from data lakes to traditional databases - enabling semantic layers that span your entire data ecosystem.

This federated metadata forms the foundation for AI-powered analytics, allowing semantic layers to understand and query data regardless of where it physically resides.

Trino Metadata Architecture

Catalog-Schema-Table Hierarchy

Trino organizes data in a three-level hierarchy:

Trino Cluster
└── Catalog (connector to data source)
    └── Schema
        └── Tables, Views

Each catalog represents a connection to a data source through a connector.

Available Connectors

Trino connects to diverse sources:

  • Hive/Iceberg/Delta Lake (data lakes)
  • MySQL, PostgreSQL, SQL Server
  • MongoDB, Cassandra, Redis
  • Kafka, Kinesis (streaming)
  • Google Sheets, REST APIs

Codd AI Integrations connect to Trino to extract unified metadata across all your data sources, building semantic layers that federate intelligently across your data ecosystem.

Metadata Extraction Queries

Catalog Discovery

Discover available data sources:

-- List all catalogs
SELECT catalog_name
FROM system.metadata.catalogs
ORDER BY catalog_name;

-- Catalog properties
SHOW CATALOGS;

-- Schemas in a catalog
SELECT schema_name
FROM hive.information_schema.schemata;

-- All schemas across catalogs
SELECT
    table_catalog as catalog_name,
    table_schema as schema_name
FROM system.information_schema.schemata
ORDER BY catalog_name, schema_name;

Table Discovery

Find tables across sources:

-- Tables in a specific catalog schema
SELECT
    table_catalog,
    table_schema,
    table_name,
    table_type
FROM hive.information_schema.tables
WHERE table_schema = 'analytics';

-- Tables across all catalogs
SELECT
    table_catalog,
    table_schema,
    table_name,
    table_type
FROM system.information_schema.tables
WHERE table_schema NOT IN ('information_schema')
ORDER BY table_catalog, table_schema, table_name;

-- Table properties (connector-specific)
SHOW CREATE TABLE hive.analytics.customers;

-- Hive-specific partitions
SELECT * FROM hive.analytics."customers$partitions";

Column Metadata

Extract column information:

-- Columns in a table
SELECT
    column_name,
    ordinal_position,
    data_type,
    is_nullable,
    column_default,
    extra_info
FROM hive.information_schema.columns
WHERE table_schema = 'analytics'
    AND table_name = 'customers'
ORDER BY ordinal_position;

-- Columns across catalogs
SELECT
    table_catalog,
    table_schema,
    table_name,
    column_name,
    data_type,
    is_nullable
FROM system.information_schema.columns
WHERE table_schema NOT IN ('information_schema')
ORDER BY table_catalog, table_schema, table_name, ordinal_position;

-- Column comments (when available)
DESCRIBE hive.analytics.customers;

View Definitions

Capture view logic:

-- View definitions
SELECT
    table_catalog,
    table_schema,
    table_name,
    view_definition
FROM system.information_schema.views
WHERE table_schema = 'analytics';

Connector-Specific Metadata

Extract source-specific information:

-- Hive table statistics
SHOW STATS FOR hive.analytics.customers;

-- Hive table properties
SELECT * FROM hive.analytics."customers$properties";

-- Iceberg table history
SELECT * FROM iceberg.analytics."customers$history";

-- Iceberg snapshots
SELECT * FROM iceberg.analytics."customers$snapshots";

-- Delta Lake history
SELECT * FROM delta.analytics."customers$history";

Building Semantic Models

Cross-Catalog Entity Mapping

Map entities from multiple sources:

# Entity from Hive data lake
entity: Customers
  source: hive.analytics.customers
  description: "Customer master data from data lake"

# Entity from PostgreSQL operational database
entity: Orders
  source: postgres.operations.orders
  description: "Order transactions from operational system"

# Cross-catalog relationship
relationships:
  - name: customer_orders
    from_entity: Customers
    to_entity: Orders
    join:
      - source: customer_id
        target: customer_id

Unified Data Types

Normalize types across sources:

Source TypeTrino TypeSemantic Type
Hive STRINGVARCHARtext
MySQL INTINTEGERmeasure
PostgreSQL TIMESTAMPTIMESTAMPtime
Iceberg DECIMALDECIMALmeasure

Federation-Aware Semantics

Design for cross-source queries:

entity: CustomerAnalytics
  description: "Unified customer view across systems"

  sources:
    - catalog: hive
      schema: analytics
      table: customer_events

    - catalog: postgres
      schema: crm
      table: customer_profiles

  # Semantic layer handles federation complexity
  join_strategy: left_outer
  primary_source: hive.analytics.customer_events

Implementation Approach

Step 1: Configure Connection

Set up Trino connectivity:

from trino.dbapi import connect

conn = connect(
    host='trino-coordinator',
    port=443,
    user='metadata_reader',
    catalog='system',
    schema='information_schema',
    http_scheme='https',
    auth=BasicAuthentication('user', 'password')
)

Step 2: Enumerate Catalogs

Discover all connected sources:

def get_catalogs(conn):
    """List all available catalogs."""
    cursor = conn.cursor()
    cursor.execute("SELECT catalog_name FROM system.metadata.catalogs")
    return [row[0] for row in cursor.fetchall()]

def get_schemas(conn, catalog):
    """List schemas in a catalog."""
    cursor = conn.cursor()
    cursor.execute(f"""
        SELECT schema_name
        FROM {catalog}.information_schema.schemata
        WHERE schema_name NOT IN ('information_schema')
    """)
    return [row[0] for row in cursor.fetchall()]

Step 3: Extract Schema Metadata

Gather comprehensive metadata:

def extract_tables(conn, catalog, schema):
    """Extract table metadata from a catalog schema."""
    cursor = conn.cursor()
    cursor.execute(f"""
        SELECT table_name, table_type
        FROM {catalog}.information_schema.tables
        WHERE table_schema = '{schema}'
    """)
    return cursor.fetchall()

def extract_columns(conn, catalog, schema, table):
    """Extract column metadata."""
    cursor = conn.cursor()
    cursor.execute(f"""
        SELECT column_name, data_type, is_nullable,
               ordinal_position
        FROM {catalog}.information_schema.columns
        WHERE table_schema = '{schema}'
            AND table_name = '{table}'
        ORDER BY ordinal_position
    """)
    return cursor.fetchall()

Step 4: Handle Connector Variations

Account for different connectors:

def get_connector_type(catalog_name, conn):
    """Determine connector type for specialized extraction."""
    cursor = conn.cursor()
    cursor.execute(f"SHOW CREATE CATALOG {catalog_name}")
    definition = cursor.fetchone()[0]

    if 'hive' in definition.lower():
        return 'hive'
    elif 'iceberg' in definition.lower():
        return 'iceberg'
    elif 'postgresql' in definition.lower():
        return 'postgresql'
    # Additional connector detection
    return 'generic'

def extract_partitions(conn, catalog, schema, table, connector_type):
    """Extract partition info if available."""
    if connector_type == 'hive':
        cursor = conn.cursor()
        cursor.execute(f"""
            SELECT * FROM {catalog}.{schema}."{table}$partitions"
        """)
        return cursor.fetchall()
    return None

Step 5: Generate Unified Semantic Model

Create cross-source semantic definitions:

semantic_model:
  name: enterprise_analytics

  catalogs:
    - name: hive
      type: data_lake
      schemas: [analytics, staging]

    - name: postgres
      type: operational
      schemas: [public, crm]

  entities:
    - name: Customers
      source: hive.analytics.customers
      primary_key: customer_id

      dimensions:
        - name: customer_name
          source_column: name
          type: text

        - name: signup_date
          source_column: created_at
          type: time

      measures:
        - name: customer_count
          calculation: COUNT(DISTINCT customer_id)

    - name: Transactions
      source: postgres.operations.transactions
      primary_key: transaction_id

      relationships:
        - name: customer
          target: Customers
          join:
            source: customer_id
            target: customer_id

Common Challenges

Inconsistent Metadata Richness

Different connectors expose varying metadata:

Solutions:

  • Implement connector-specific extraction
  • Supplement with manual documentation
  • Use semantic layer as metadata authority
  • Normalize across sources

Missing Relationships

Data lake sources lack referential integrity:

Solutions:

  • Infer from naming conventions
  • Analyze query patterns
  • Define relationships in semantic layer
  • Document business relationships

Cross-Catalog Performance

Federated queries can be slow:

Solutions:

  • Understand query pushdown capabilities
  • Design semantic models for efficient federation
  • Cache frequently joined data
  • Use materialized views where possible

Schema Evolution

Sources evolve independently:

Solutions:

  • Schedule regular metadata extraction
  • Compare schemas over time
  • Alert on breaking changes
  • Version semantic models

The Value of Trino Metadata Extraction

Automated metadata extraction from Trino delivers:

Unified visibility: See all data sources through one interface.

Federation intelligence: Build semantic layers that span sources.

Source transparency: Hide complexity while preserving context.

Modern architecture: Enable lakehouse and multi-source analytics.

Organizations that automate Trino metadata extraction build semantic layers that truly unify their data ecosystem - enabling AI-powered analytics that seamlessly query across data lakes, databases, and streaming sources.

Questions

Trino federates across multiple data sources through connectors. Each connector exposes its source's schema through Trino's metadata interface. The information_schema provides unified access, but metadata richness varies by connector - some provide detailed statistics while others offer basic schema information.

Related