Metadata Extraction from BigQuery: Schema Intelligence for Semantic Layers

Learn how to extract metadata from Google BigQuery for building semantic layers, including INFORMATION_SCHEMA queries, dataset discovery, and relationship identification for AI-powered analytics.

6 min read·

Metadata extraction from BigQuery is the process of programmatically reading dataset structures, table definitions, column information, and query patterns from Google BigQuery to build intelligent semantic layers. BigQuery's INFORMATION_SCHEMA provides comprehensive access to metadata, enabling automated discovery of your data warehouse structure and the relationships within it.

This extracted metadata forms the foundation for AI-powered analytics, allowing semantic layers to translate business questions into accurate SQL queries against your BigQuery data.

BigQuery Metadata Structure

Project and Dataset Organization

BigQuery organizes data hierarchically:

Organization
└── Project
    └── Dataset
        └── Tables, Views, Routines

Metadata extraction must traverse this hierarchy to build a complete picture.

Available Metadata

BigQuery exposes extensive metadata:

  • Project and dataset information
  • Table schemas and column definitions
  • Partitioning and clustering configuration
  • View definitions and materialized views
  • Routine (function and procedure) definitions
  • Table statistics and storage information
  • Query history and access patterns

Codd AI Integrations connect to BigQuery to extract this metadata automatically, building semantic layers that understand your entire data warehouse structure.

INFORMATION_SCHEMA Queries

Dataset-Level Metadata

Query metadata within a specific dataset:

-- List all tables in a dataset
SELECT table_name, table_type, creation_time,
       row_count, size_bytes
FROM `project.dataset.INFORMATION_SCHEMA.TABLES`;

-- Get column details
SELECT table_name, column_name, ordinal_position,
       is_nullable, data_type, column_default
FROM `project.dataset.INFORMATION_SCHEMA.COLUMNS`
ORDER BY table_name, ordinal_position;

-- Partitioning information
SELECT table_name, partition_id, total_rows,
       total_logical_bytes
FROM `project.dataset.INFORMATION_SCHEMA.PARTITIONS`;

-- Table options including descriptions
SELECT table_name, option_name, option_value
FROM `project.dataset.INFORMATION_SCHEMA.TABLE_OPTIONS`
WHERE option_name = 'description';

Dataset-level queries provide detailed object information.

Regional Metadata Views

Query metadata across datasets in a region:

-- All tables in a region
SELECT table_catalog, table_schema, table_name,
       table_type, creation_time
FROM `project.region-us.INFORMATION_SCHEMA.TABLES`;

-- Column descriptions across datasets
SELECT table_schema, table_name, column_name, description
FROM `project.region-us.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS`
WHERE description IS NOT NULL;

-- View definitions
SELECT table_name, view_definition
FROM `project.region-us.INFORMATION_SCHEMA.VIEWS`;

Regional views enable project-wide metadata discovery.

Query Pattern Analysis

Understand how data is used:

-- Recent query patterns
SELECT
  user_email,
  query,
  referenced_tables,
  total_bytes_processed
FROM `project.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
  AND job_type = 'QUERY'
  AND state = 'DONE';

Query history reveals actual usage patterns and common joins.

Extraction Strategies

Full Discovery

Initial extraction captures complete structure:

Discovery workflow:

  1. Enumerate all accessible datasets
  2. Query INFORMATION_SCHEMA for each dataset
  3. Extract table and column metadata
  4. Capture partitioning and clustering details
  5. Retrieve descriptions and labels
  6. Analyze query patterns for relationships

Output: Comprehensive catalog of BigQuery objects.

Incremental Updates

Ongoing extraction focuses on changes:

-- Find recently modified tables
SELECT table_name, last_modified_time
FROM `project.dataset.INFORMATION_SCHEMA.TABLES`
WHERE last_modified_time > TIMESTAMP(@last_extraction);

Change detection approaches:

  • Compare table modification times
  • Monitor dataset activity
  • Track schema version changes
  • Use Cloud Logging for DDL events

Cross-Project Extraction

Enterprise environments span multiple projects:

Approach:

  • Use organization-level service account
  • Iterate through accessible projects
  • Query regional INFORMATION_SCHEMA per project
  • Aggregate metadata centrally

Considerations:

  • IAM permissions across projects
  • Network connectivity
  • Extraction scheduling

Building Semantic Models

Table to Entity Mapping

Transform BigQuery metadata into semantic concepts:

Tables and views become semantic entities:

  • Fact tables become measurable entities
  • Dimension tables become descriptive entities
  • Views may represent pre-calculated metrics

Columns map to attributes:

  • STRING and DATE columns become dimensions
  • INT64, FLOAT64, NUMERIC become measures
  • TIMESTAMP columns become time dimensions
  • ARRAY and STRUCT require special handling

Relationship Identification

BigQuery lacks enforced foreign keys, requiring inference:

Naming convention analysis:

  • Columns ending in _id often reference other tables
  • Matching column names suggest relationships
  • Prefixes may indicate table relationships

Query pattern analysis:

  • JOIN clauses in query history reveal relationships
  • Frequently joined tables are likely related
  • Subquery patterns show data dependencies

Documentation review:

  • Table and column descriptions may indicate relationships
  • Labels can encode relationship information
  • External documentation supplements metadata

Handling BigQuery-Specific Features

Address unique BigQuery capabilities:

Partitioning:

  • Understand partition columns for time-based queries
  • Map to time dimensions in semantic layer
  • Optimize semantic queries using partition pruning

Nested and repeated fields:

  • ARRAY fields require unnesting in queries
  • STRUCT fields provide hierarchical attributes
  • Semantic layer must handle complex types

Clustering:

  • Leverage clustering for query performance
  • Align semantic model queries with cluster keys

Implementation Guide

Step 1: Configure Authentication

Set up secure access:

# Service account with required roles
# - BigQuery Data Viewer (for table data)
# - BigQuery Metadata Viewer (for metadata)
# - BigQuery Job User (for queries)

from google.cloud import bigquery
client = bigquery.Client()

Step 2: Enumerate Datasets

Discover available datasets:

# List all datasets in a project
datasets = list(client.list_datasets())

for dataset in datasets:
    print(f"Dataset: {dataset.dataset_id}")
    tables = list(client.list_tables(dataset.reference))
    for table in tables:
        print(f"  Table: {table.table_id}")

Step 3: Extract Table Metadata

Get detailed table information:

# Get table schema and metadata
table = client.get_table("project.dataset.table_name")

schema_info = {
    "table_name": table.table_id,
    "description": table.description,
    "num_rows": table.num_rows,
    "columns": [
        {
            "name": field.name,
            "type": field.field_type,
            "mode": field.mode,
            "description": field.description
        }
        for field in table.schema
    ],
    "partitioning": table.time_partitioning,
    "clustering": table.clustering_fields
}

Step 4: Transform to Semantic Model

Map extracted metadata to semantic definitions:

entity: Orders
  source: project.analytics.orders
  description: "Customer order transactions"

  dimensions:
    - name: order_date
      source_column: order_timestamp
      type: time
      grain: day

    - name: customer_segment
      source_column: customer_segment
      type: categorical

  measures:
    - name: total_revenue
      calculation: SUM(order_total)
      format: currency

    - name: order_count
      calculation: COUNT(DISTINCT order_id)

Step 5: Schedule Synchronization

Maintain metadata currency:

  • Daily extraction for stable schemas
  • Triggered extraction for schema changes
  • Validation after each extraction
  • Alerting on significant changes

Common Challenges

Cross-Region Metadata

BigQuery datasets are regional:

Solutions:

  • Extract from each region separately
  • Aggregate metadata centrally
  • Handle region-specific query routing

Complex Nested Schemas

Deeply nested STRUCT and ARRAY types:

Solutions:

  • Flatten to accessible columns
  • Create semantic layer functions for nested access
  • Document nested structure handling

Missing Descriptions

Many BigQuery objects lack documentation:

Solutions:

  • Implement description requirements
  • Generate AI-suggested descriptions
  • Layer documentation in semantic model
  • Encourage source documentation

The Value of BigQuery Metadata Extraction

Automated metadata extraction from BigQuery delivers:

Complete visibility: Understand all datasets, tables, and columns.

Relationship discovery: Infer connections between tables.

Usage intelligence: Learn from query patterns.

Continuous synchronization: Keep semantic layers current.

Organizations that automate BigQuery metadata extraction build semantic layers that accurately reflect their data warehouse - enabling AI-powered analytics that deliver consistent, trustworthy insights across the enterprise.

Questions

BigQuery INFORMATION_SCHEMA provides comprehensive metadata including table schemas, column definitions, partitioning information, clustering details, table options, views, routines, and access controls. Regional INFORMATION_SCHEMA views provide project-wide visibility, while dataset-level views show detailed object information.

Related