What metadata is available in BigQuery INFORMATION_SCHEMA?

BigQuery INFORMATION_SCHEMA provides comprehensive metadata including table schemas, column definitions, partitioning information, clustering details, table options, views, routines, and access controls. Regional INFORMATION_SCHEMA views provide project-wide visibility, while dataset-level views show detailed object information.

How do I extract metadata across multiple BigQuery projects?

Use regional INFORMATION_SCHEMA views with cross-project queries, or leverage the BigQuery API to enumerate projects and their datasets programmatically. Service accounts with appropriate IAM roles can access metadata across projects within an organization.

Does BigQuery have native relationship metadata?

BigQuery does not enforce foreign key constraints, so explicit relationship metadata is limited. However, you can infer relationships through naming conventions, query patterns from JOBS_BY_PROJECT, and column descriptions. Defining relationships in your semantic layer becomes essential.

How can I optimize metadata extraction costs in BigQuery?

INFORMATION_SCHEMA queries are free and do not consume slots. However, querying JOBS_BY_PROJECT or other logging tables may incur costs. Cache metadata locally, use incremental extraction based on table modification times, and schedule extraction during off-peak hours to minimize impact.

Metadata Extraction from BigQuery: Schema Intelligence for Semantic Layers

Metadata extraction from BigQuery is the process of programmatically reading dataset structures, table definitions, column information, and query patterns from Google BigQuery to build intelligent semantic layers. BigQuery's INFORMATION_SCHEMA provides comprehensive access to metadata, enabling automated discovery of your data warehouse structure and the relationships within it.

This extracted metadata forms the foundation for AI-powered analytics, allowing semantic layers to translate business questions into accurate SQL queries against your BigQuery data.

BigQuery Metadata Structure

Project and Dataset Organization

BigQuery organizes data hierarchically:

Organization
└── Project
    └── Dataset
        └── Tables, Views, Routines

Metadata extraction must traverse this hierarchy to build a complete picture.

Available Metadata

BigQuery exposes extensive metadata:

Project and dataset information
Table schemas and column definitions
Partitioning and clustering configuration
View definitions and materialized views
Routine (function and procedure) definitions
Table statistics and storage information
Query history and access patterns

Codd AI Integrations connect to BigQuery to extract this metadata automatically, building semantic layers that understand your entire data warehouse structure.

INFORMATION_SCHEMA Queries

Dataset-Level Metadata

Query metadata within a specific dataset:

-- List all tables in a dataset
SELECT table_name, table_type, creation_time,
       row_count, size_bytes
FROM `project.dataset.INFORMATION_SCHEMA.TABLES`;

-- Get column details
SELECT table_name, column_name, ordinal_position,
       is_nullable, data_type, column_default
FROM `project.dataset.INFORMATION_SCHEMA.COLUMNS`
ORDER BY table_name, ordinal_position;

-- Partitioning information
SELECT table_name, partition_id, total_rows,
       total_logical_bytes
FROM `project.dataset.INFORMATION_SCHEMA.PARTITIONS`;

-- Table options including descriptions
SELECT table_name, option_name, option_value
FROM `project.dataset.INFORMATION_SCHEMA.TABLE_OPTIONS`
WHERE option_name = 'description';

Dataset-level queries provide detailed object information.

Regional Metadata Views

Query metadata across datasets in a region:

-- All tables in a region
SELECT table_catalog, table_schema, table_name,
       table_type, creation_time
FROM `project.region-us.INFORMATION_SCHEMA.TABLES`;

-- Column descriptions across datasets
SELECT table_schema, table_name, column_name, description
FROM `project.region-us.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS`
WHERE description IS NOT NULL;

-- View definitions
SELECT table_name, view_definition
FROM `project.region-us.INFORMATION_SCHEMA.VIEWS`;

Regional views enable project-wide metadata discovery.

Query Pattern Analysis

Understand how data is used:

-- Recent query patterns
SELECT
  user_email,
  query,
  referenced_tables,
  total_bytes_processed
FROM `project.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
  AND job_type = 'QUERY'
  AND state = 'DONE';

Query history reveals actual usage patterns and common joins.

Extraction Strategies

Full Discovery

Initial extraction captures complete structure:

Discovery workflow:

Enumerate all accessible datasets
Query INFORMATION_SCHEMA for each dataset
Extract table and column metadata
Capture partitioning and clustering details
Retrieve descriptions and labels
Analyze query patterns for relationships

Output: Comprehensive catalog of BigQuery objects.

Incremental Updates

Ongoing extraction focuses on changes:

-- Find recently modified tables
SELECT table_name, last_modified_time
FROM `project.dataset.INFORMATION_SCHEMA.TABLES`
WHERE last_modified_time > TIMESTAMP(@last_extraction);

Change detection approaches:

Compare table modification times
Monitor dataset activity
Track schema version changes
Use Cloud Logging for DDL events

Cross-Project Extraction

Enterprise environments span multiple projects:

Approach:

Use organization-level service account
Iterate through accessible projects
Query regional INFORMATION_SCHEMA per project
Aggregate metadata centrally

Considerations:

IAM permissions across projects
Network connectivity
Extraction scheduling

Building Semantic Models

Table to Entity Mapping

Transform BigQuery metadata into semantic concepts:

Tables and views become semantic entities:

Fact tables become measurable entities
Dimension tables become descriptive entities
Views may represent pre-calculated metrics

Columns map to attributes:

STRING and DATE columns become dimensions
INT64, FLOAT64, NUMERIC become measures
TIMESTAMP columns become time dimensions
ARRAY and STRUCT require special handling

Relationship Identification

BigQuery lacks enforced foreign keys, requiring inference:

Naming convention analysis:

Columns ending in _id often reference other tables
Matching column names suggest relationships
Prefixes may indicate table relationships

Query pattern analysis:

JOIN clauses in query history reveal relationships
Frequently joined tables are likely related
Subquery patterns show data dependencies

Documentation review:

Table and column descriptions may indicate relationships
Labels can encode relationship information
External documentation supplements metadata

Handling BigQuery-Specific Features

Address unique BigQuery capabilities:

Partitioning:

Understand partition columns for time-based queries
Map to time dimensions in semantic layer
Optimize semantic queries using partition pruning

Nested and repeated fields:

ARRAY fields require unnesting in queries
STRUCT fields provide hierarchical attributes
Semantic layer must handle complex types

Clustering:

Leverage clustering for query performance
Align semantic model queries with cluster keys

Implementation Guide

Step 1: Configure Authentication

Set up secure access:

# Service account with required roles
# - BigQuery Data Viewer (for table data)
# - BigQuery Metadata Viewer (for metadata)
# - BigQuery Job User (for queries)

from google.cloud import bigquery
client = bigquery.Client()

Step 2: Enumerate Datasets

Discover available datasets:

# List all datasets in a project
datasets = list(client.list_datasets())

for dataset in datasets:
    print(f"Dataset: {dataset.dataset_id}")
    tables = list(client.list_tables(dataset.reference))
    for table in tables:
        print(f"  Table: {table.table_id}")

Step 3: Extract Table Metadata

Get detailed table information:

# Get table schema and metadata
table = client.get_table("project.dataset.table_name")

schema_info = {
    "table_name": table.table_id,
    "description": table.description,
    "num_rows": table.num_rows,
    "columns": [
        {
            "name": field.name,
            "type": field.field_type,
            "mode": field.mode,
            "description": field.description
        }
        for field in table.schema
    ],
    "partitioning": table.time_partitioning,
    "clustering": table.clustering_fields
}

Step 4: Transform to Semantic Model

Map extracted metadata to semantic definitions:

entity: Orders
  source: project.analytics.orders
  description: "Customer order transactions"

  dimensions:
    - name: order_date
      source_column: order_timestamp
      type: time
      grain: day

    - name: customer_segment
      source_column: customer_segment
      type: categorical

  measures:
    - name: total_revenue
      calculation: SUM(order_total)
      format: currency

    - name: order_count
      calculation: COUNT(DISTINCT order_id)

Step 5: Schedule Synchronization

Maintain metadata currency:

Daily extraction for stable schemas
Triggered extraction for schema changes
Validation after each extraction
Alerting on significant changes

Common Challenges

Cross-Region Metadata

BigQuery datasets are regional:

Solutions:

Extract from each region separately
Aggregate metadata centrally
Handle region-specific query routing

Complex Nested Schemas

Deeply nested STRUCT and ARRAY types:

Solutions:

Flatten to accessible columns
Create semantic layer functions for nested access
Document nested structure handling

Missing Descriptions

Many BigQuery objects lack documentation:

Solutions:

Implement description requirements
Generate AI-suggested descriptions
Layer documentation in semantic model
Encourage source documentation

The Value of BigQuery Metadata Extraction

Automated metadata extraction from BigQuery delivers:

Complete visibility: Understand all datasets, tables, and columns.

Relationship discovery: Infer connections between tables.

Usage intelligence: Learn from query patterns.

Continuous synchronization: Keep semantic layers current.

Organizations that automate BigQuery metadata extraction build semantic layers that accurately reflect their data warehouse - enabling AI-powered analytics that deliver consistent, trustworthy insights across the enterprise.

Metadata Extraction from BigQuery: Schema Intelligence for Semantic Layers

BigQuery Metadata Structure

Project and Dataset Organization

Available Metadata

INFORMATION_SCHEMA Queries

Dataset-Level Metadata

Regional Metadata Views

Query Pattern Analysis

Extraction Strategies

Full Discovery

Incremental Updates

Cross-Project Extraction

Building Semantic Models

Table to Entity Mapping

Relationship Identification

Handling BigQuery-Specific Features

Implementation Guide

Step 1: Configure Authentication

Step 2: Enumerate Datasets

Step 3: Extract Table Metadata

Step 4: Transform to Semantic Model

Step 5: Schedule Synchronization

Common Challenges

Cross-Region Metadata

Complex Nested Schemas

Missing Descriptions

The Value of BigQuery Metadata Extraction

Questions

Related