Metadata Extraction from Dremio: Data Lakehouse Schema Discovery for Semantic Layers

Learn how to extract metadata from Dremio for building semantic layers, including catalog discovery, data reflections, and unified metadata for self-service analytics.

6 min read·

Metadata extraction from Dremio is the process of programmatically reading schema structures, dataset definitions, and catalog information from Dremio's data lakehouse platform to build intelligent semantic layers. Dremio provides a unified catalog over diverse data sources - from object storage to databases - enabling semantic layers that leverage virtualized, accelerated data access.

This metadata forms the foundation for AI-powered analytics, allowing semantic layers to understand Dremio's curated data landscape and generate optimized queries.

Dremio Metadata Architecture

Catalog Organization

Dremio organizes data in a hierarchical catalog:

Dremio Catalog
├── Sources (connections to data)
│   ├── Data lakes (S3, ADLS, HDFS)
│   ├── Databases (PostgreSQL, MySQL)
│   └── File systems
├── Spaces (organizational folders)
│   └── Virtual Datasets (views)
└── Home folders (user workspaces)

Dataset Types

Dremio manages different dataset types:

Physical datasets: Direct access to source data Virtual datasets: Views with transformations applied Promoted datasets: Files promoted to tables Reflections: Accelerated materializations

Codd AI Integrations connect to Dremio's catalog to extract comprehensive metadata, building semantic layers that leverage Dremio's data virtualization capabilities.

Metadata Extraction Methods

INFORMATION_SCHEMA Queries

Standard SQL metadata access:

-- List all catalogs (sources and spaces)
SELECT CATALOG_NAME, CATALOG_DESCRIPTION
FROM INFORMATION_SCHEMA.CATALOGS;

-- Discover schemas
SELECT
    CATALOG_NAME,
    SCHEMA_NAME,
    SCHEMA_OWNER
FROM INFORMATION_SCHEMA.SCHEMATA
ORDER BY CATALOG_NAME, SCHEMA_NAME;

-- List all tables and views
SELECT
    TABLE_CATALOG,
    TABLE_SCHEMA,
    TABLE_NAME,
    TABLE_TYPE
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA NOT LIKE 'sys%'
ORDER BY TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME;

-- Column metadata
SELECT
    TABLE_CATALOG,
    TABLE_SCHEMA,
    TABLE_NAME,
    COLUMN_NAME,
    ORDINAL_POSITION,
    DATA_TYPE,
    IS_NULLABLE,
    CHARACTER_MAXIMUM_LENGTH,
    NUMERIC_PRECISION,
    NUMERIC_SCALE
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = 'analytics'
ORDER BY TABLE_NAME, ORDINAL_POSITION;

System Tables

Access Dremio-specific metadata:

-- Dataset details including type
SELECT
    dataset_name,
    dataset_type,
    created,
    owner
FROM sys.datasets
WHERE dataset_type IN ('VIRTUAL_DATASET', 'PHYSICAL_DATASET');

-- Reflection information
SELECT
    reflection_id,
    reflection_name,
    dataset_name,
    type,
    status
FROM sys.reflections;

-- Job history for usage patterns
SELECT
    job_id,
    query_type,
    dataset,
    start_time,
    finish_time,
    row_count
FROM sys.jobs
WHERE start_time > CURRENT_TIMESTAMP - INTERVAL '30' DAY;

-- Source connections
SELECT
    name as source_name,
    type as source_type,
    config
FROM sys.sources;

REST API Access

Programmatic catalog access:

import requests

base_url = "https://dremio-instance/api/v3"
headers = {"Authorization": f"Bearer {token}"}

# List catalog entries
response = requests.get(
    f"{base_url}/catalog",
    headers=headers
)
catalog = response.json()

# Get dataset details
dataset_id = "dataset-uuid"
response = requests.get(
    f"{base_url}/catalog/{dataset_id}",
    headers=headers
)
dataset_info = response.json()

# Get dataset schema
response = requests.get(
    f"{base_url}/catalog/{dataset_id}/schema",
    headers=headers
)
schema = response.json()

Building Semantic Models

Dataset to Entity Mapping

Transform Dremio metadata into semantic concepts:

Physical datasets become source entities:

  • Direct mapping to underlying data
  • Respect source schema structure
  • Include source type context

Virtual datasets become curated entities:

  • Transformations are pre-applied
  • May represent business-ready views
  • Include definition for lineage

Column mapping:

  • VARCHAR, TEXT become text dimensions
  • INTEGER, BIGINT, DECIMAL become measures
  • DATE, TIMESTAMP become time dimensions
  • BOOLEAN becomes categorical dimension

Understanding Virtual Dataset Lineage

Extract virtual dataset definitions:

-- Get view definitions
SELECT
    TABLE_SCHEMA,
    TABLE_NAME,
    VIEW_DEFINITION
FROM INFORMATION_SCHEMA.VIEWS
WHERE TABLE_SCHEMA = 'analytics';

Map lineage for semantic understanding:

entity: CustomerMetrics
  source: analytics.customer_metrics_vds
  type: virtual_dataset
  description: "Aggregated customer metrics from virtual dataset"

  lineage:
    # From virtual dataset definition
    source_tables:
      - source.customers
      - source.transactions

  dimensions:
    - name: customer_segment
      source_column: segment

  measures:
    - name: total_spend
      calculation: SUM(lifetime_value)

Reflection-Aware Semantics

Understand acceleration for optimization:

entity: SalesTransactions
  source: analytics.sales_fact
  description: "Sales transaction fact table"

  # Note reflections for performance context
  accelerations:
    - type: aggregation
      dimensions: [date, region, product]
      measures: [sum_revenue, count_orders]

  # Semantic layer can hint at optimized dimensions
  recommended_dimensions:
    - date
    - region
    - product

Implementation Approach

Step 1: Configure Connection

Set up Dremio connectivity:

from pyodbc import connect

# ODBC connection
conn = connect(
    "Driver={Dremio ODBC Driver};"
    "HOST=dremio-coordinator;"
    "PORT=31010;"
    "UID=metadata_reader;"
    "PWD=secure_password;"
    "SSL=1;"
)

# Or using Arrow Flight
from dremio.flight import connect as flight_connect

client = flight_connect(
    host='dremio-coordinator',
    port=32010,
    username='metadata_reader',
    password='secure_password',
    tls=True
)

Step 2: Enumerate Catalog

Discover available datasets:

def get_spaces(conn):
    """List all spaces and sources."""
    cursor = conn.cursor()
    cursor.execute("""
        SELECT DISTINCT CATALOG_NAME
        FROM INFORMATION_SCHEMA.SCHEMATA
        ORDER BY CATALOG_NAME
    """)
    return [row[0] for row in cursor.fetchall()]

def get_datasets(conn, catalog, schema):
    """List datasets in a schema."""
    cursor = conn.cursor()
    cursor.execute(f"""
        SELECT TABLE_NAME, TABLE_TYPE
        FROM INFORMATION_SCHEMA.TABLES
        WHERE TABLE_CATALOG = ?
            AND TABLE_SCHEMA = ?
    """, (catalog, schema))
    return cursor.fetchall()

Step 3: Extract Schema Metadata

Gather comprehensive metadata:

def extract_columns(conn, catalog, schema, table):
    """Extract column metadata."""
    cursor = conn.cursor()
    cursor.execute("""
        SELECT
            COLUMN_NAME,
            DATA_TYPE,
            IS_NULLABLE,
            ORDINAL_POSITION
        FROM INFORMATION_SCHEMA.COLUMNS
        WHERE TABLE_CATALOG = ?
            AND TABLE_SCHEMA = ?
            AND TABLE_NAME = ?
        ORDER BY ORDINAL_POSITION
    """, (catalog, schema, table))
    return cursor.fetchall()

def extract_virtual_dataset_definition(conn, schema, table):
    """Get virtual dataset SQL definition."""
    cursor = conn.cursor()
    cursor.execute("""
        SELECT VIEW_DEFINITION
        FROM INFORMATION_SCHEMA.VIEWS
        WHERE TABLE_SCHEMA = ?
            AND TABLE_NAME = ?
    """, (schema, table))
    result = cursor.fetchone()
    return result[0] if result else None

Step 4: Capture Dremio-Specific Metadata

Extract platform-specific information:

def get_reflections(conn, dataset_path):
    """Get reflection information for a dataset."""
    cursor = conn.cursor()
    cursor.execute("""
        SELECT
            reflection_id,
            reflection_name,
            type,
            status
        FROM sys.reflections
        WHERE dataset_name = ?
    """, (dataset_path,))
    return cursor.fetchall()

def get_dataset_type(conn, dataset_path):
    """Determine if dataset is physical or virtual."""
    cursor = conn.cursor()
    cursor.execute("""
        SELECT dataset_type
        FROM sys.datasets
        WHERE dataset_name = ?
    """, (dataset_path,))
    result = cursor.fetchone()
    return result[0] if result else None

Step 5: Generate Semantic Model

Transform to semantic definitions:

semantic_model:
  name: dremio_analytics

  sources:
    - name: s3_datalake
      type: physical
      path: "s3.analytics"

    - name: curated_views
      type: virtual
      path: "analytics_space.curated"

  entities:
    - name: Customers
      source: analytics_space.curated.customers_vds
      type: virtual_dataset
      description: "Curated customer view"

      dimensions:
        - name: customer_name
          source_column: name
          type: text

        - name: region
          source_column: region_code
          type: categorical

        - name: signup_date
          source_column: created_at
          type: time
          grain: day

      measures:
        - name: customer_count
          calculation: COUNT(DISTINCT customer_id)

    - name: Transactions
      source: s3_datalake.transactions
      type: physical_dataset
      description: "Raw transaction data from data lake"

      relationships:
        - name: customer
          target: Customers
          join:
            source: customer_id
            target: customer_id

Common Challenges

Virtual Dataset Complexity

Complex virtual datasets with many transformations:

Solutions:

  • Extract view definitions for lineage
  • Document transformation logic
  • Use virtual datasets as semantic sources
  • Simplify complex chains

Source Heterogeneity

Diverse source types with different capabilities:

Solutions:

  • Normalize metadata across sources
  • Document source-specific limitations
  • Use Dremio as abstraction layer
  • Focus on promoted/virtual datasets

Reflection Transparency

Reflections are invisible to queries:

Solutions:

  • Extract reflection metadata for context
  • Align semantic dimensions with reflections
  • Document acceleration patterns
  • Monitor query performance

Catalog Scale

Large catalogs with many datasets:

Solutions:

  • Focus on curated spaces
  • Filter by dataset type
  • Prioritize by usage
  • Implement incremental extraction

The Value of Dremio Metadata Extraction

Automated metadata extraction from Dremio delivers:

Unified visibility: See all data sources through Dremio's catalog.

Curation awareness: Leverage virtual datasets for clean semantic models.

Acceleration context: Understand reflections for optimization.

Self-service enablement: Build semantic layers on user-friendly data.

Organizations that automate Dremio metadata extraction build semantic layers that leverage data virtualization - enabling AI-powered analytics that access curated, accelerated data regardless of where it physically resides.

Questions

Dremio provides metadata through INFORMATION_SCHEMA views for standard schema access, sys tables for internal metadata including reflections and jobs, and the REST API for programmatic access. Virtual datasets, physical datasets, and spaces are all accessible through these interfaces.

Related