Metadata Extraction from Azure SQL: Cloud Database Schema Discovery for Semantic Layers

Learn how to extract metadata from Azure SQL Database and Azure Synapse Analytics for building semantic layers, including system catalog queries and cloud-specific considerations.

7 min read·

Metadata extraction from Azure SQL is the process of programmatically reading schema structures, table definitions, column information, and relationships from Azure SQL Database or Azure Synapse Analytics to build intelligent semantic layers. Azure SQL provides standard SQL Server system catalogs alongside cloud-specific views, enabling comprehensive metadata discovery for modern cloud analytics.

This metadata forms the foundation for AI-powered analytics, allowing semantic layers to understand your cloud database structure and generate optimized T-SQL queries.

Azure SQL Metadata Architecture

Azure SQL Database

Azure SQL Database uses SQL Server's system catalog:

System views (sys schema):

  • sys.tables, sys.views
  • sys.columns, sys.types
  • sys.foreign_keys, sys.foreign_key_columns
  • sys.indexes, sys.index_columns
  • sys.extended_properties

INFORMATION_SCHEMA:

  • Standard SQL interface
  • Cross-platform compatibility
  • Simplified metadata access

Azure Synapse Analytics

Synapse adds data warehouse-specific views:

  • sys.pdw_table_distribution_properties
  • sys.pdw_column_distribution_properties
  • sys.external_tables, sys.external_data_sources
  • sys.workload_management_workload_groups

Codd AI Integrations connect to Azure SQL and Synapse to extract this metadata automatically, building semantic layers optimized for your cloud analytics environment.

Metadata Extraction Queries

Schema and Table Discovery

Discover database structure:

-- List all schemas
SELECT schema_id, name as schema_name
FROM sys.schemas
WHERE name NOT IN ('sys', 'INFORMATION_SCHEMA', 'guest');

-- Discover tables with row counts
SELECT
    s.name as schema_name,
    t.name as table_name,
    t.create_date,
    t.modify_date,
    p.rows as row_count
FROM sys.tables t
JOIN sys.schemas s ON t.schema_id = s.schema_id
JOIN sys.partitions p ON t.object_id = p.object_id
WHERE p.index_id IN (0, 1)  -- Heap or clustered index
    AND s.name = 'dbo'
ORDER BY t.name;

-- Views in schema
SELECT
    s.name as schema_name,
    v.name as view_name,
    v.create_date
FROM sys.views v
JOIN sys.schemas s ON v.schema_id = s.schema_id
WHERE s.name = 'dbo';

-- Extended properties (descriptions)
SELECT
    OBJECT_NAME(ep.major_id) as object_name,
    ep.name as property_name,
    ep.value as property_value
FROM sys.extended_properties ep
WHERE ep.class = 1  -- Object or column
    AND ep.name = 'MS_Description';

Column Metadata

Extract detailed column information:

-- Column definitions
SELECT
    s.name as schema_name,
    t.name as table_name,
    c.name as column_name,
    c.column_id,
    ty.name as data_type,
    c.max_length,
    c.precision,
    c.scale,
    c.is_nullable,
    c.is_identity,
    c.is_computed,
    ep.value as column_description
FROM sys.columns c
JOIN sys.tables t ON c.object_id = t.object_id
JOIN sys.schemas s ON t.schema_id = s.schema_id
JOIN sys.types ty ON c.user_type_id = ty.user_type_id
LEFT JOIN sys.extended_properties ep
    ON ep.major_id = c.object_id
    AND ep.minor_id = c.column_id
    AND ep.name = 'MS_Description'
WHERE s.name = 'dbo'
ORDER BY t.name, c.column_id;

-- Computed column expressions
SELECT
    OBJECT_NAME(object_id) as table_name,
    name as column_name,
    definition as expression
FROM sys.computed_columns
WHERE OBJECT_SCHEMA_NAME(object_id) = 'dbo';

Constraint Discovery

Extract relationship information:

-- Primary keys
SELECT
    s.name as schema_name,
    t.name as table_name,
    kc.name as constraint_name,
    c.name as column_name,
    ic.key_ordinal
FROM sys.key_constraints kc
JOIN sys.tables t ON kc.parent_object_id = t.object_id
JOIN sys.schemas s ON t.schema_id = s.schema_id
JOIN sys.index_columns ic ON kc.parent_object_id = ic.object_id
    AND kc.unique_index_id = ic.index_id
JOIN sys.columns c ON ic.object_id = c.object_id
    AND ic.column_id = c.column_id
WHERE kc.type = 'PK'
    AND s.name = 'dbo'
ORDER BY t.name, ic.key_ordinal;

-- Foreign keys with references
SELECT
    s.name as schema_name,
    t.name as table_name,
    fk.name as constraint_name,
    COL_NAME(fkc.parent_object_id, fkc.parent_column_id) as column_name,
    OBJECT_NAME(fk.referenced_object_id) as referenced_table,
    COL_NAME(fkc.referenced_object_id, fkc.referenced_column_id) as referenced_column
FROM sys.foreign_keys fk
JOIN sys.tables t ON fk.parent_object_id = t.object_id
JOIN sys.schemas s ON t.schema_id = s.schema_id
JOIN sys.foreign_key_columns fkc ON fk.object_id = fkc.constraint_object_id
WHERE s.name = 'dbo'
ORDER BY t.name, fk.name;

-- Unique constraints
SELECT
    s.name as schema_name,
    t.name as table_name,
    kc.name as constraint_name,
    STRING_AGG(c.name, ', ') WITHIN GROUP (ORDER BY ic.key_ordinal) as columns
FROM sys.key_constraints kc
JOIN sys.tables t ON kc.parent_object_id = t.object_id
JOIN sys.schemas s ON t.schema_id = s.schema_id
JOIN sys.index_columns ic ON kc.parent_object_id = ic.object_id
    AND kc.unique_index_id = ic.index_id
JOIN sys.columns c ON ic.object_id = c.object_id
    AND ic.column_id = c.column_id
WHERE kc.type = 'UQ'
    AND s.name = 'dbo'
GROUP BY s.name, t.name, kc.name;

Azure Synapse-Specific Metadata

Extract Synapse distribution information:

-- Distribution type per table
SELECT
    s.name as schema_name,
    t.name as table_name,
    tdp.distribution_policy_desc as distribution_type
FROM sys.tables t
JOIN sys.schemas s ON t.schema_id = s.schema_id
JOIN sys.pdw_table_distribution_properties tdp
    ON t.object_id = tdp.object_id
WHERE s.name = 'dbo';

-- Distribution column
SELECT
    s.name as schema_name,
    t.name as table_name,
    c.name as distribution_column
FROM sys.tables t
JOIN sys.schemas s ON t.schema_id = s.schema_id
JOIN sys.pdw_column_distribution_properties cdp
    ON t.object_id = cdp.object_id
JOIN sys.columns c
    ON cdp.object_id = c.object_id
    AND cdp.column_id = c.column_id
WHERE cdp.distribution_ordinal = 1
    AND s.name = 'dbo';

-- External tables (PolyBase)
SELECT
    s.name as schema_name,
    et.name as table_name,
    eds.name as data_source_name,
    et.location
FROM sys.external_tables et
JOIN sys.schemas s ON et.schema_id = s.schema_id
JOIN sys.external_data_sources eds
    ON et.data_source_id = eds.data_source_id;

Building Semantic Models

Table to Entity Mapping

Transform Azure SQL metadata into semantic concepts:

Tables become entities:

  • Primary key defines entity identity
  • Extended properties provide descriptions
  • Distribution strategy informs query optimization

Columns map to attributes:

  • nvarchar, varchar become text dimensions
  • int, bigint, decimal become measures
  • date, datetime2 become time dimensions
  • bit becomes boolean dimension

Relationship Mapping

Leverage SQL Server's constraint system:

entity: Orders
  source: dbo.Orders
  description: "Customer order transactions"

  relationships:
    - name: customer
      type: many_to_one
      target: Customers
      join:
        source_column: CustomerID
        target_column: CustomerID

Synapse Optimization

Use distribution metadata for query optimization:

entity: SalesTransactions
  source: dbo.FactSales
  distribution: HASH(CustomerID)

  # Semantic layer can optimize joins
  # with tables distributed on same key

Implementation Approach

Step 1: Configure Authentication

Set up Azure SQL connectivity:

import pyodbc

# Using SQL authentication
conn_str = (
    "Driver={ODBC Driver 18 for SQL Server};"
    "Server=server.database.windows.net;"
    "Database=analytics;"
    "Uid=metadata_reader;"
    "Pwd=secure_password;"
    "Encrypt=yes;"
    "TrustServerCertificate=no;"
)

# Using Azure AD service principal
conn_str = (
    "Driver={ODBC Driver 18 for SQL Server};"
    "Server=server.database.windows.net;"
    "Database=analytics;"
    "Authentication=ActiveDirectoryServicePrincipal;"
    f"UID={client_id};"
    f"PWD={client_secret};"
)

conn = pyodbc.connect(conn_str)

Step 2: Create Extraction User

Configure appropriate permissions:

-- Create extraction user
CREATE USER metadata_reader FROM LOGIN metadata_reader;

-- Grant read permissions
GRANT VIEW DEFINITION ON SCHEMA::dbo TO metadata_reader;
GRANT SELECT ON SCHEMA::dbo TO metadata_reader;

-- For extended properties
GRANT VIEW ANY DEFINITION TO metadata_reader;

Step 3: Extract Comprehensive Metadata

Build complete schema picture:

def extract_tables(conn, schema='dbo'):
    """Extract table metadata with descriptions."""
    query = """
        SELECT
            t.name as table_name,
            p.rows as row_count,
            ep.value as description
        FROM sys.tables t
        JOIN sys.schemas s ON t.schema_id = s.schema_id
        JOIN sys.partitions p ON t.object_id = p.object_id
        LEFT JOIN sys.extended_properties ep
            ON ep.major_id = t.object_id
            AND ep.minor_id = 0
            AND ep.name = 'MS_Description'
        WHERE s.name = ?
            AND p.index_id IN (0, 1)
    """
    cursor = conn.cursor()
    cursor.execute(query, schema)
    return cursor.fetchall()

Step 4: Handle Azure-Specific Features

Account for cloud capabilities:

Temporal tables:

SELECT
    t.name as table_name,
    t.temporal_type_desc,
    ht.name as history_table
FROM sys.tables t
LEFT JOIN sys.tables ht
    ON t.history_table_id = ht.object_id
WHERE t.temporal_type <> 0;

Column encryption:

SELECT
    c.name as column_name,
    c.encryption_type_desc
FROM sys.columns c
WHERE c.encryption_type IS NOT NULL;

Step 5: Generate Semantic Model

Transform to semantic definitions:

entity: Customers
  source: dbo.Customers
  description: "Customer master data with contact information"
  primary_key: CustomerID

  dimensions:
    - name: customer_name
      source_column: CustomerName
      type: text

    - name: region
      source_column: Region
      type: categorical

    - name: signup_date
      source_column: CreatedDate
      type: time
      grain: day

  measures:
    - name: customer_count
      calculation: COUNT(DISTINCT CustomerID)

  relationships:
    - name: orders
      target: Orders
      type: one_to_many
      join_column: CustomerID

Common Challenges

Multi-Database Environments

Azure SQL databases are isolated:

Solutions:

  • Extract from each database separately
  • Aggregate metadata centrally
  • Use elastic database queries for cross-db views
  • Document database relationships

Synapse Serverless Pools

Dynamic schema with external data:

Solutions:

  • Query sys.external_tables for registered objects
  • Handle schema inference for parquet files
  • Document data source configurations

Performance Tier Limitations

Smaller tiers have resource constraints:

Solutions:

  • Schedule extraction during low-usage periods
  • Limit query complexity
  • Cache metadata locally
  • Use incremental extraction

The Value of Azure SQL Metadata Extraction

Automated metadata extraction from Azure SQL delivers:

Cloud-native visibility: Understand your Azure database structure.

Power BI integration: Align semantic layer with Microsoft ecosystem.

Synapse optimization: Leverage distribution metadata for performance.

Unified governance: Connect with Azure Purview and Microsoft governance tools.

Organizations that automate Azure SQL metadata extraction build semantic layers that integrate seamlessly with the Microsoft ecosystem - enabling AI-powered analytics across Azure cloud environments.

Questions

Azure SQL Database uses standard SQL Server system views including sys.tables, sys.columns, sys.foreign_keys, and INFORMATION_SCHEMA views. These provide comprehensive metadata about tables, columns, relationships, and constraints identical to on-premises SQL Server.

Related