Metadata Extraction from Dremio: Data Lakehouse Schema Discovery for Semantic Layers
Learn how to extract metadata from Dremio for building semantic layers, including catalog discovery, data reflections, and unified metadata for self-service analytics.
Metadata extraction from Dremio is the process of programmatically reading schema structures, dataset definitions, and catalog information from Dremio's data lakehouse platform to build intelligent semantic layers. Dremio provides a unified catalog over diverse data sources - from object storage to databases - enabling semantic layers that leverage virtualized, accelerated data access.
This metadata forms the foundation for AI-powered analytics, allowing semantic layers to understand Dremio's curated data landscape and generate optimized queries.
Dremio Metadata Architecture
Catalog Organization
Dremio organizes data in a hierarchical catalog:
Dremio Catalog
├── Sources (connections to data)
│ ├── Data lakes (S3, ADLS, HDFS)
│ ├── Databases (PostgreSQL, MySQL)
│ └── File systems
├── Spaces (organizational folders)
│ └── Virtual Datasets (views)
└── Home folders (user workspaces)
Dataset Types
Dremio manages different dataset types:
Physical datasets: Direct access to source data Virtual datasets: Views with transformations applied Promoted datasets: Files promoted to tables Reflections: Accelerated materializations
Codd AI Integrations connect to Dremio's catalog to extract comprehensive metadata, building semantic layers that leverage Dremio's data virtualization capabilities.
Metadata Extraction Methods
INFORMATION_SCHEMA Queries
Standard SQL metadata access:
-- List all catalogs (sources and spaces)
SELECT CATALOG_NAME, CATALOG_DESCRIPTION
FROM INFORMATION_SCHEMA.CATALOGS;
-- Discover schemas
SELECT
CATALOG_NAME,
SCHEMA_NAME,
SCHEMA_OWNER
FROM INFORMATION_SCHEMA.SCHEMATA
ORDER BY CATALOG_NAME, SCHEMA_NAME;
-- List all tables and views
SELECT
TABLE_CATALOG,
TABLE_SCHEMA,
TABLE_NAME,
TABLE_TYPE
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA NOT LIKE 'sys%'
ORDER BY TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME;
-- Column metadata
SELECT
TABLE_CATALOG,
TABLE_SCHEMA,
TABLE_NAME,
COLUMN_NAME,
ORDINAL_POSITION,
DATA_TYPE,
IS_NULLABLE,
CHARACTER_MAXIMUM_LENGTH,
NUMERIC_PRECISION,
NUMERIC_SCALE
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = 'analytics'
ORDER BY TABLE_NAME, ORDINAL_POSITION;
System Tables
Access Dremio-specific metadata:
-- Dataset details including type
SELECT
dataset_name,
dataset_type,
created,
owner
FROM sys.datasets
WHERE dataset_type IN ('VIRTUAL_DATASET', 'PHYSICAL_DATASET');
-- Reflection information
SELECT
reflection_id,
reflection_name,
dataset_name,
type,
status
FROM sys.reflections;
-- Job history for usage patterns
SELECT
job_id,
query_type,
dataset,
start_time,
finish_time,
row_count
FROM sys.jobs
WHERE start_time > CURRENT_TIMESTAMP - INTERVAL '30' DAY;
-- Source connections
SELECT
name as source_name,
type as source_type,
config
FROM sys.sources;
REST API Access
Programmatic catalog access:
import requests
base_url = "https://dremio-instance/api/v3"
headers = {"Authorization": f"Bearer {token}"}
# List catalog entries
response = requests.get(
f"{base_url}/catalog",
headers=headers
)
catalog = response.json()
# Get dataset details
dataset_id = "dataset-uuid"
response = requests.get(
f"{base_url}/catalog/{dataset_id}",
headers=headers
)
dataset_info = response.json()
# Get dataset schema
response = requests.get(
f"{base_url}/catalog/{dataset_id}/schema",
headers=headers
)
schema = response.json()
Building Semantic Models
Dataset to Entity Mapping
Transform Dremio metadata into semantic concepts:
Physical datasets become source entities:
- Direct mapping to underlying data
- Respect source schema structure
- Include source type context
Virtual datasets become curated entities:
- Transformations are pre-applied
- May represent business-ready views
- Include definition for lineage
Column mapping:
- VARCHAR, TEXT become text dimensions
- INTEGER, BIGINT, DECIMAL become measures
- DATE, TIMESTAMP become time dimensions
- BOOLEAN becomes categorical dimension
Understanding Virtual Dataset Lineage
Extract virtual dataset definitions:
-- Get view definitions
SELECT
TABLE_SCHEMA,
TABLE_NAME,
VIEW_DEFINITION
FROM INFORMATION_SCHEMA.VIEWS
WHERE TABLE_SCHEMA = 'analytics';
Map lineage for semantic understanding:
entity: CustomerMetrics
source: analytics.customer_metrics_vds
type: virtual_dataset
description: "Aggregated customer metrics from virtual dataset"
lineage:
# From virtual dataset definition
source_tables:
- source.customers
- source.transactions
dimensions:
- name: customer_segment
source_column: segment
measures:
- name: total_spend
calculation: SUM(lifetime_value)
Reflection-Aware Semantics
Understand acceleration for optimization:
entity: SalesTransactions
source: analytics.sales_fact
description: "Sales transaction fact table"
# Note reflections for performance context
accelerations:
- type: aggregation
dimensions: [date, region, product]
measures: [sum_revenue, count_orders]
# Semantic layer can hint at optimized dimensions
recommended_dimensions:
- date
- region
- product
Implementation Approach
Step 1: Configure Connection
Set up Dremio connectivity:
from pyodbc import connect
# ODBC connection
conn = connect(
"Driver={Dremio ODBC Driver};"
"HOST=dremio-coordinator;"
"PORT=31010;"
"UID=metadata_reader;"
"PWD=secure_password;"
"SSL=1;"
)
# Or using Arrow Flight
from dremio.flight import connect as flight_connect
client = flight_connect(
host='dremio-coordinator',
port=32010,
username='metadata_reader',
password='secure_password',
tls=True
)
Step 2: Enumerate Catalog
Discover available datasets:
def get_spaces(conn):
"""List all spaces and sources."""
cursor = conn.cursor()
cursor.execute("""
SELECT DISTINCT CATALOG_NAME
FROM INFORMATION_SCHEMA.SCHEMATA
ORDER BY CATALOG_NAME
""")
return [row[0] for row in cursor.fetchall()]
def get_datasets(conn, catalog, schema):
"""List datasets in a schema."""
cursor = conn.cursor()
cursor.execute(f"""
SELECT TABLE_NAME, TABLE_TYPE
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_CATALOG = ?
AND TABLE_SCHEMA = ?
""", (catalog, schema))
return cursor.fetchall()
Step 3: Extract Schema Metadata
Gather comprehensive metadata:
def extract_columns(conn, catalog, schema, table):
"""Extract column metadata."""
cursor = conn.cursor()
cursor.execute("""
SELECT
COLUMN_NAME,
DATA_TYPE,
IS_NULLABLE,
ORDINAL_POSITION
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_CATALOG = ?
AND TABLE_SCHEMA = ?
AND TABLE_NAME = ?
ORDER BY ORDINAL_POSITION
""", (catalog, schema, table))
return cursor.fetchall()
def extract_virtual_dataset_definition(conn, schema, table):
"""Get virtual dataset SQL definition."""
cursor = conn.cursor()
cursor.execute("""
SELECT VIEW_DEFINITION
FROM INFORMATION_SCHEMA.VIEWS
WHERE TABLE_SCHEMA = ?
AND TABLE_NAME = ?
""", (schema, table))
result = cursor.fetchone()
return result[0] if result else None
Step 4: Capture Dremio-Specific Metadata
Extract platform-specific information:
def get_reflections(conn, dataset_path):
"""Get reflection information for a dataset."""
cursor = conn.cursor()
cursor.execute("""
SELECT
reflection_id,
reflection_name,
type,
status
FROM sys.reflections
WHERE dataset_name = ?
""", (dataset_path,))
return cursor.fetchall()
def get_dataset_type(conn, dataset_path):
"""Determine if dataset is physical or virtual."""
cursor = conn.cursor()
cursor.execute("""
SELECT dataset_type
FROM sys.datasets
WHERE dataset_name = ?
""", (dataset_path,))
result = cursor.fetchone()
return result[0] if result else None
Step 5: Generate Semantic Model
Transform to semantic definitions:
semantic_model:
name: dremio_analytics
sources:
- name: s3_datalake
type: physical
path: "s3.analytics"
- name: curated_views
type: virtual
path: "analytics_space.curated"
entities:
- name: Customers
source: analytics_space.curated.customers_vds
type: virtual_dataset
description: "Curated customer view"
dimensions:
- name: customer_name
source_column: name
type: text
- name: region
source_column: region_code
type: categorical
- name: signup_date
source_column: created_at
type: time
grain: day
measures:
- name: customer_count
calculation: COUNT(DISTINCT customer_id)
- name: Transactions
source: s3_datalake.transactions
type: physical_dataset
description: "Raw transaction data from data lake"
relationships:
- name: customer
target: Customers
join:
source: customer_id
target: customer_id
Common Challenges
Virtual Dataset Complexity
Complex virtual datasets with many transformations:
Solutions:
- Extract view definitions for lineage
- Document transformation logic
- Use virtual datasets as semantic sources
- Simplify complex chains
Source Heterogeneity
Diverse source types with different capabilities:
Solutions:
- Normalize metadata across sources
- Document source-specific limitations
- Use Dremio as abstraction layer
- Focus on promoted/virtual datasets
Reflection Transparency
Reflections are invisible to queries:
Solutions:
- Extract reflection metadata for context
- Align semantic dimensions with reflections
- Document acceleration patterns
- Monitor query performance
Catalog Scale
Large catalogs with many datasets:
Solutions:
- Focus on curated spaces
- Filter by dataset type
- Prioritize by usage
- Implement incremental extraction
The Value of Dremio Metadata Extraction
Automated metadata extraction from Dremio delivers:
Unified visibility: See all data sources through Dremio's catalog.
Curation awareness: Leverage virtual datasets for clean semantic models.
Acceleration context: Understand reflections for optimization.
Self-service enablement: Build semantic layers on user-friendly data.
Organizations that automate Dremio metadata extraction build semantic layers that leverage data virtualization - enabling AI-powered analytics that access curated, accelerated data regardless of where it physically resides.
Questions
Dremio provides metadata through INFORMATION_SCHEMA views for standard schema access, sys tables for internal metadata including reflections and jobs, and the REST API for programmatic access. Virtual datasets, physical datasets, and spaces are all accessible through these interfaces.