Metadata Extraction from BigQuery: Schema Intelligence for Semantic Layers
Learn how to extract metadata from Google BigQuery for building semantic layers, including INFORMATION_SCHEMA queries, dataset discovery, and relationship identification for AI-powered analytics.
Metadata extraction from BigQuery is the process of programmatically reading dataset structures, table definitions, column information, and query patterns from Google BigQuery to build intelligent semantic layers. BigQuery's INFORMATION_SCHEMA provides comprehensive access to metadata, enabling automated discovery of your data warehouse structure and the relationships within it.
This extracted metadata forms the foundation for AI-powered analytics, allowing semantic layers to translate business questions into accurate SQL queries against your BigQuery data.
BigQuery Metadata Structure
Project and Dataset Organization
BigQuery organizes data hierarchically:
Organization
└── Project
└── Dataset
└── Tables, Views, Routines
Metadata extraction must traverse this hierarchy to build a complete picture.
Available Metadata
BigQuery exposes extensive metadata:
- Project and dataset information
- Table schemas and column definitions
- Partitioning and clustering configuration
- View definitions and materialized views
- Routine (function and procedure) definitions
- Table statistics and storage information
- Query history and access patterns
Codd AI Integrations connect to BigQuery to extract this metadata automatically, building semantic layers that understand your entire data warehouse structure.
INFORMATION_SCHEMA Queries
Dataset-Level Metadata
Query metadata within a specific dataset:
-- List all tables in a dataset
SELECT table_name, table_type, creation_time,
row_count, size_bytes
FROM `project.dataset.INFORMATION_SCHEMA.TABLES`;
-- Get column details
SELECT table_name, column_name, ordinal_position,
is_nullable, data_type, column_default
FROM `project.dataset.INFORMATION_SCHEMA.COLUMNS`
ORDER BY table_name, ordinal_position;
-- Partitioning information
SELECT table_name, partition_id, total_rows,
total_logical_bytes
FROM `project.dataset.INFORMATION_SCHEMA.PARTITIONS`;
-- Table options including descriptions
SELECT table_name, option_name, option_value
FROM `project.dataset.INFORMATION_SCHEMA.TABLE_OPTIONS`
WHERE option_name = 'description';
Dataset-level queries provide detailed object information.
Regional Metadata Views
Query metadata across datasets in a region:
-- All tables in a region
SELECT table_catalog, table_schema, table_name,
table_type, creation_time
FROM `project.region-us.INFORMATION_SCHEMA.TABLES`;
-- Column descriptions across datasets
SELECT table_schema, table_name, column_name, description
FROM `project.region-us.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS`
WHERE description IS NOT NULL;
-- View definitions
SELECT table_name, view_definition
FROM `project.region-us.INFORMATION_SCHEMA.VIEWS`;
Regional views enable project-wide metadata discovery.
Query Pattern Analysis
Understand how data is used:
-- Recent query patterns
SELECT
user_email,
query,
referenced_tables,
total_bytes_processed
FROM `project.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
AND job_type = 'QUERY'
AND state = 'DONE';
Query history reveals actual usage patterns and common joins.
Extraction Strategies
Full Discovery
Initial extraction captures complete structure:
Discovery workflow:
- Enumerate all accessible datasets
- Query INFORMATION_SCHEMA for each dataset
- Extract table and column metadata
- Capture partitioning and clustering details
- Retrieve descriptions and labels
- Analyze query patterns for relationships
Output: Comprehensive catalog of BigQuery objects.
Incremental Updates
Ongoing extraction focuses on changes:
-- Find recently modified tables
SELECT table_name, last_modified_time
FROM `project.dataset.INFORMATION_SCHEMA.TABLES`
WHERE last_modified_time > TIMESTAMP(@last_extraction);
Change detection approaches:
- Compare table modification times
- Monitor dataset activity
- Track schema version changes
- Use Cloud Logging for DDL events
Cross-Project Extraction
Enterprise environments span multiple projects:
Approach:
- Use organization-level service account
- Iterate through accessible projects
- Query regional INFORMATION_SCHEMA per project
- Aggregate metadata centrally
Considerations:
- IAM permissions across projects
- Network connectivity
- Extraction scheduling
Building Semantic Models
Table to Entity Mapping
Transform BigQuery metadata into semantic concepts:
Tables and views become semantic entities:
- Fact tables become measurable entities
- Dimension tables become descriptive entities
- Views may represent pre-calculated metrics
Columns map to attributes:
- STRING and DATE columns become dimensions
- INT64, FLOAT64, NUMERIC become measures
- TIMESTAMP columns become time dimensions
- ARRAY and STRUCT require special handling
Relationship Identification
BigQuery lacks enforced foreign keys, requiring inference:
Naming convention analysis:
- Columns ending in
_idoften reference other tables - Matching column names suggest relationships
- Prefixes may indicate table relationships
Query pattern analysis:
- JOIN clauses in query history reveal relationships
- Frequently joined tables are likely related
- Subquery patterns show data dependencies
Documentation review:
- Table and column descriptions may indicate relationships
- Labels can encode relationship information
- External documentation supplements metadata
Handling BigQuery-Specific Features
Address unique BigQuery capabilities:
Partitioning:
- Understand partition columns for time-based queries
- Map to time dimensions in semantic layer
- Optimize semantic queries using partition pruning
Nested and repeated fields:
- ARRAY fields require unnesting in queries
- STRUCT fields provide hierarchical attributes
- Semantic layer must handle complex types
Clustering:
- Leverage clustering for query performance
- Align semantic model queries with cluster keys
Implementation Guide
Step 1: Configure Authentication
Set up secure access:
# Service account with required roles
# - BigQuery Data Viewer (for table data)
# - BigQuery Metadata Viewer (for metadata)
# - BigQuery Job User (for queries)
from google.cloud import bigquery
client = bigquery.Client()
Step 2: Enumerate Datasets
Discover available datasets:
# List all datasets in a project
datasets = list(client.list_datasets())
for dataset in datasets:
print(f"Dataset: {dataset.dataset_id}")
tables = list(client.list_tables(dataset.reference))
for table in tables:
print(f" Table: {table.table_id}")
Step 3: Extract Table Metadata
Get detailed table information:
# Get table schema and metadata
table = client.get_table("project.dataset.table_name")
schema_info = {
"table_name": table.table_id,
"description": table.description,
"num_rows": table.num_rows,
"columns": [
{
"name": field.name,
"type": field.field_type,
"mode": field.mode,
"description": field.description
}
for field in table.schema
],
"partitioning": table.time_partitioning,
"clustering": table.clustering_fields
}
Step 4: Transform to Semantic Model
Map extracted metadata to semantic definitions:
entity: Orders
source: project.analytics.orders
description: "Customer order transactions"
dimensions:
- name: order_date
source_column: order_timestamp
type: time
grain: day
- name: customer_segment
source_column: customer_segment
type: categorical
measures:
- name: total_revenue
calculation: SUM(order_total)
format: currency
- name: order_count
calculation: COUNT(DISTINCT order_id)
Step 5: Schedule Synchronization
Maintain metadata currency:
- Daily extraction for stable schemas
- Triggered extraction for schema changes
- Validation after each extraction
- Alerting on significant changes
Common Challenges
Cross-Region Metadata
BigQuery datasets are regional:
Solutions:
- Extract from each region separately
- Aggregate metadata centrally
- Handle region-specific query routing
Complex Nested Schemas
Deeply nested STRUCT and ARRAY types:
Solutions:
- Flatten to accessible columns
- Create semantic layer functions for nested access
- Document nested structure handling
Missing Descriptions
Many BigQuery objects lack documentation:
Solutions:
- Implement description requirements
- Generate AI-suggested descriptions
- Layer documentation in semantic model
- Encourage source documentation
The Value of BigQuery Metadata Extraction
Automated metadata extraction from BigQuery delivers:
Complete visibility: Understand all datasets, tables, and columns.
Relationship discovery: Infer connections between tables.
Usage intelligence: Learn from query patterns.
Continuous synchronization: Keep semantic layers current.
Organizations that automate BigQuery metadata extraction build semantic layers that accurately reflect their data warehouse - enabling AI-powered analytics that deliver consistent, trustworthy insights across the enterprise.
Questions
BigQuery INFORMATION_SCHEMA provides comprehensive metadata including table schemas, column definitions, partitioning information, clustering details, table options, views, routines, and access controls. Regional INFORMATION_SCHEMA views provide project-wide visibility, while dataset-level views show detailed object information.