Why plan before generating SQL in NL2SQL workflows?

Creating a step-by-step plan before SQL generation improves accuracy by breaking down complex questions, clarifying ambiguous business terms, identifying necessary joins and filters, and providing a sanity check for the result grain. This planning phase significantly reduces SQL generation errors.

Natural Language to SQL: Production Workflow Guide with Examples (2025)

Q: What is Natural Language to SQL (NL2SQL)?

Natural Language to SQL (NL2SQL) is a technology that converts human language questions into structured SQL queries. It uses hybrid retrieval combining BM25 and vector embeddings to find relevant tables, then generates validated SQL using language models with constrained decoding and strict validation rules.

Q: How does hybrid retrieval improve NL2SQL accuracy?

Hybrid retrieval combines BM25 for lexical precision with dense embeddings for semantic understanding, then applies a cross-encoder reranker. This approach retrieves the most relevant table schemas by balancing exact keyword matches with conceptual similarity, significantly improving SQL generation accuracy.

Q: What validation is needed for production NL2SQL systems?

Production NL2SQL systems require multi-layer validation: SQL parsing to ensure syntactic correctness, linting for best practices, safety checks to block DML operations, read-only execution with timeout and row caps, and constrained decoding during generation to prevent invalid token sequences.

Overview: Production Natural Language to SQL Architecture

Natural language to SQL (NL2SQL), also known as text-to-SQL, is a critical capability for modern data platforms. This guide presents a production-ready workflow that separates two challenging problems: retrieving the right schema context from your data warehouse, and generating validated SQL queries that answer natural language questions accurately.

This NL2SQL architecture leverages hybrid retrieval (combining BM25 lexical search with vector embeddings), constrained decoding for SQL generation using large language models (LLMs), and multi-layer validation to ensure query safety and correctness. The approach works with any SQL dialect including Snowflake, BigQuery, PostgreSQL, and Redshift.

Key Components of Production NL2SQL Systems

Hybrid Schema Retrieval: BM25 + dense embeddings + cross-encoder reranking
Plan-First Generation: Break complex questions into structured query plans
Constrained Decoding: Generate only syntactically valid SQL tokens
Multi-Layer Validation: Parsing, linting, safety checks, and repair loops
Safe Execution: Read-only access, timeouts, row limits, partition filtering
Provenance Tracking: Return SQL, tables used, and clarifications for auditability

sequenceDiagram actor User participant Orchestrator participant Index as Hybrid Index participant DB as Warehouse participant LLM participant Rules as Validator participant Exec as SQL Runner User->>Orchestrator: Natural language question Orchestrator->>Index: Retrieve table and column candidates Index-->>Orchestrator: Ranked snippets with scores Orchestrator->>DB: Pull precise schemas for top candidates Orchestrator->>LLM: Summarize context, request a plan LLM-->>Orchestrator: Step by step query plan alt Samples helpful Orchestrator->>Exec: Run safe sample selects Exec-->>Orchestrator: Few rows per table Orchestrator->>LLM: Reflect with samples, refine plan end Orchestrator->>LLM: Generate SQL for target dialect LLM-->>Orchestrator: Draft SQL Orchestrator->>Rules: Parse, lint, safety checks alt Validation fails Rules-->>Orchestrator: Errors and hints Orchestrator->>LLM: Repair with findings LLM-->>Orchestrator: Repaired SQL Orchestrator->>Rules: Re validate end Orchestrator->>Exec: Execute read only with limits Exec-->>Orchestrator: Result set Orchestrator->>LLM: Turn rows into a concise answer Orchestrator-->>User: Answer, SQL, provenance

flowchart TD Q[User question] --> R[Hybrid retrieve: BM25 plus vectors, re rank] R --> C{Coverage sufficient} C -- No --> Clar[Ask clarifying questions] --> R C -- Yes --> P[Produce step by step plan] P --> S{Need samples} S -- Yes --> Sample[Run small SELECTs with LIMIT] --> P S -- No --> G[Generate SQL] G --> V[Parse, lint, safety checks] V -->|Fail| Fix[Repair with findings] --> V V -->|Pass| X[Execute with read only role and caps] X --> NLG[Summarize results, attach SQL and tables]

The Demo Dataset: Snowflake Schema for NL2SQL Examples

Throughout this natural language to SQL tutorial, we use a realistic Snowflake-style schema representing e-commerce operations. This dataset demonstrates common NL2SQL challenges including multi-table joins, time-based filtering, customer segmentation, and support ticket analysis.

Schema Overview

sales.orders: Core order data with order_id, customer_id, order_ts, marketing_channel, region, total_amount
sales.order_items: Line-level detail with order_id, sku, quantity, price
crm.customers: Customer attributes including customer_id, first_order_ts, is_first_time_buyer, status
support.tickets: Support history with ticket_id, customer_id, created_ts, category, resolved_ts

Example Natural Language Question:

"Among first-time buyers in Q3 2025, how many churned within 30 days after their second order? Break down the results by marketing channel and region, and show common support ticket categories."

This question requires:

Multi-table joins across crm.customers, sales.orders, and support.tickets
Window functions to identify the second order per customer
Time-based filtering and date arithmetic
Aggregation by multiple dimensions
Array operations for support categories

Stage 1: Hybrid Retrieval for Schema Discovery in NL2SQL

Hybrid retrieval is the foundation of accurate NL2SQL systems. By combining BM25 lexical search with dense vector embeddings and a cross-encoder reranker, you can retrieve the most relevant table schemas for any natural language question.

Building the Schema Index for Text-to-SQL

Create a searchable index containing:

Table names and descriptions
Column names with data types
Column-level comments and business definitions
Lineage information and foreign key relationships
Sample queries and common use cases

Hybrid Search Process

BM25 Lexical Search: Find exact keyword matches in table/column names and descriptions
Dense Vector Search: Use embeddings (e.g., OpenAI text-embedding-3, Cohere embed-v3) to find semantically similar schemas
Score Fusion: Normalize and combine BM25 and vector scores using reciprocal rank fusion or weighted averaging
Cross-Encoder Reranking: Apply a cross-encoder model to rerank top candidates based on question-schema relevance

For the example question about first-time buyer churn, hybrid retrieval typically returns crm.customers, sales.orders, and support.tickets with high confidence scores. The system then fetches complete schemas (DDL, constraints, indexes) only for these high-scoring tables, keeping the LLM context compact and relevant.

Practical Implementation Tips for NL2SQL Retrieval

Normalize scores: BM25 and cosine similarity operate on different scales—use min-max normalization or reciprocal rank fusion before combining
Keep context concise: Return a factsheet per table including grain, primary keys, useful filters, and a sample SELECT statement (max 5-10 rows)
Cache embeddings: Pre-compute and cache table/column embeddings to reduce latency
Monitor retrieval quality: Track whether retrieved tables are actually used in final SQL (precision) and whether necessary tables are included (recall)

Stage 2: Plan-Based SQL Generation for Complex Queries

Before generating SQL, request the LLM to produce a step-by-step query plan. This planning phase dramatically improves accuracy for complex natural language questions by explicitly reasoning about joins, filters, aggregations, and edge cases.

What a Good SQL Query Plan Contains

Table selection: Which tables to query and why
Join strategy: How tables relate (keys, join type, join conditions)
Filter logic: WHERE clauses, date ranges, status conditions
Derived fields: Calculations, window functions, case statements
Aggregation: GROUP BY dimensions and aggregate functions
Ordering and limits: Sort order and result size constraints
Sanity check: One-line description of expected result grain (e.g., "one row per marketing channel per region")

Example Query Plan for First-Time Buyer Churn Analysis

Filter crm.customers to first-time buyers (is_first_time_buyer = TRUE) with first order in Q3 2025
Compute each customer's second order timestamp using ROW_NUMBER() window function over sales.orders
Define churn as status = 'inactive' within 30 days after the second order date
Join to sales.orders to get marketing_channel and region for the second order
Group results by marketing_channel and region to count churned customers
Collect support.tickets.category values for tickets created within 30 days after second order using ARRAY_AGG()
Apply ORDER BY churned_count DESC and LIMIT 200 for performance
Sanity check: Result should have one row per combination of marketing channel, region, and support category set

When to Use Sample Queries

If business terms are ambiguous (e.g., "What values does 'status' have?"), run small sample queries with SELECT DISTINCT column LIMIT 20 to inspect actual data. Feed these samples back into the planning phase to refine logic.

Stage 3: SQL Generation with Validation and Repair

This stage combines constrained decoding for initial SQL generation with aggressive validation and an automated repair loop to ensure production-quality queries.

Constrained Decoding for SQL Generation

Constrained decoding restricts the LLM to generate only tokens that maintain valid SQL syntax at each step. This technique significantly reduces parsing errors compared to unconstrained generation. Implement using:

Grammar-based constraints (e.g., LMQL, Guidance, Outlines)
SQL parser in the decoding loop to validate partial queries
Beam search with validity scoring

Multi-Layer SQL Validation Pipeline

Syntax Parsing: Use sqlparse, sqlglot, or dialect-specific parsers to verify syntactic correctness
Semantic Validation: Check that referenced tables/columns exist in the schema
Linting: Apply sqlfluff or custom rules for style, readability, anti-patterns
Safety Checks: Block DML (INSERT, UPDATE, DELETE), DDL, DROP, TRUNCATE, dynamic SQL
Performance Checks: Warn on missing WHERE clauses, Cartesian products, missing indexes

Automated Repair Loop for NL2SQL

When validation fails, use a specialized repair prompt to fix errors:

You are a SQL repair assistant for {DIALECT}. 

Input:
- Original SQL query
- Parser errors: {ERRORS}
- Linter warnings: {WARNINGS}
- Available tables and columns: {SCHEMA}

Output:
- Corrected SQL only
- Keep CTE names and structure unchanged
- Never add DML operations
- Maintain original query intent

Fix the errors and return only the corrected SQL.

Run validation again after repair. Allow up to 3 repair attempts before escalating to human review.

Production SQL Example: First-Time Buyer Churn

Here's the validated SQL query generated for our example question, demonstrating proper CTEs, window functions, joins, and aggregations:

Stage 1, retrieve the right context with a hybrid index

Create a light index from table names, descriptions, columns, column comments, and short lineage notes. Query that index using hybrid retrieval: BM25 for lexical precision, dense embeddings for semantics, then apply a cross encoder reranker to improve the top K. This pattern transfers directly to schema retrieval for text to SQL.

For the example question, the index typically returns crm.customers, sales.orders, and support.tickets with high scores. Pull full schemas only for these candidates. This mirrors common SQL agent flows that fetch relevant schemas, generate a query, then double check it.

Practical notes

Normalize and fuse lexical and vector scores, or rely on a reranker, since BM25 and cosine similarity live on different scales, then re rank.
Keep context concise, for example a short factsheet per table: grain, keys, useful filters, and a safe sample select.

Stage 2, plan before generating SQL

Request a numbered plan: joins, filters, derived fields, aggregation, ordering, limit, plus a one line sanity check that states the result grain. If business terms are ambiguous, add short clarifying questions. When categories or statuses are uncertain, run tiny sample queries to inspect distinct values, then refine the plan.

Typical plan for the example

Filter crm.customers to first time buyers in Q3 2025
Compute each customer’s second order timestamp from sales.orders
Define churn as status = 'inactive' within 30 days after the second order
Group by marketing_channel and region
Collect frequent support.tickets.category within 30 days after the second order

Stage 3, generate SQL with guardrails and validate aggressively

Use constrained decoding so the model emits only tokens that keep the partial query valid. After generation, validate with a parser and a linter. Parse and transpile across dialects to enable structural checks and normalization. Lint to improve readability and catch rule level issues. If validation fails, run a tight repair loop and re validate.

Representative SQL for the example

WITH second_order AS (
  SELECT
    customer_id,
    MIN(CASE WHEN rn = 2 THEN order_ts END) AS second_order_ts
  FROM (
    SELECT
      customer_id,
      order_ts,
      ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_ts) AS rn
    FROM sales.orders
    WHERE order_ts >= '2025-07-01' AND order_ts < '2025-10-01'
  )
  GROUP BY customer_id
),
buyer_scope AS (
  SELECT
    c.customer_id,
    c.status,
    c.first_order_ts,
    so.second_order_ts,
    o.marketing_channel,
    o.region
  FROM crm.customers c
  JOIN second_order so ON so.customer_id = c.customer_id
  JOIN sales.orders o
    ON o.customer_id = c.customer_id
    AND o.order_ts = so.second_order_ts
  WHERE c.is_first_time_buyer = TRUE
),
churned AS (
  SELECT
    b.customer_id,
    b.marketing_channel,
    b.region
  FROM buyer_scope b
  WHERE b.status = 'inactive'
    AND b.second_order_ts IS NOT NULL
    AND b.second_order_ts <= DATEADD(day, 30, b.first_order_ts)
),
support_categories AS (
  SELECT
    t.customer_id,
    ARRAY_AGG_DISTINCT(t.category) AS categories
  FROM support.tickets t
  JOIN buyer_scope b ON b.customer_id = t.customer_id
  WHERE t.created_ts >= b.second_order_ts
    AND t.created_ts < DATEADD(day, 30, b.second_order_ts)
  GROUP BY t.customer_id
)
SELECT
  c.marketing_channel,
  c.region,
  COUNT(*) AS churned_first_time_buyers,
  COALESCE(sc.categories, ARRAY_CONSTRUCT()) AS common_support_categories
FROM churned c
LEFT JOIN support_categories sc ON sc.customer_id = c.customer_id
GROUP BY c.marketing_channel, c.region, sc.categories
ORDER BY churned_first_time_buyers DESC
LIMIT 200;

SQL Best Practices for NL2SQL Systems

Use CTEs for readability and debugging—each CTE should have a clear business purpose
Apply appropriate LIMIT clauses to prevent runaway queries (typically 100-1000 rows)
Include explicit WHERE clauses with date ranges to leverage partition pruning
Use qualified column names (table.column) to avoid ambiguity
Prefer window functions over self-joins for performance

Stage 4: Safe SQL Execution in Production NL2SQL Systems

Security and performance controls are essential when executing LLM-generated SQL in production environments. Never execute untrusted SQL without proper sandboxing and resource limits.

Execution Safety Checklist

Read-only database role: Grant only SELECT permissions—never INSERT, UPDATE, DELETE, or DDL
Hard query timeout: Set statement timeout (e.g., 30 seconds) to prevent resource exhaustion
Row limit enforcement: Cap results at application level (e.g., FETCH FIRST 1000 ROWS)
Connection pooling: Use separate connection pool with max concurrency limits
Query cost estimation: Leverage warehouse-specific EXPLAIN or cost APIs to reject expensive queries
Partition awareness: For partitioned tables, require time-range filters or apply automatic date constraints

Handling Large Result Sets

If a query would scan massive partitions:

Request narrower time window from user (e.g., "last 7 days" instead of "all time")
Apply warehouse-native sampling (SAMPLE BERNOULLI(1) in Snowflake)
Suggest aggregate-first approach (e.g., daily summaries instead of raw events)
Iterate with user to refine question

Audit Logging for Compliance

Log every executed query with:

Original natural language question
Generated SQL query
User identity and timestamp
Tables accessed
Execution duration and row count
Validation results

Stage 5: Answer Rendering with Full Provenance

The final stage transforms raw SQL results into clear, actionable answers while maintaining full provenance for auditability and trust.

Components of a Complete NL2SQL Answer

Natural language summary: 2-3 sentence narrative answering the question directly
Key metrics: Highlight the most important numbers (e.g., "234 first-time buyers churned")
Visualizations: Render charts/tables for breakdowns (e.g., bar chart by marketing channel)
Exact SQL query: Show the validated SQL for reproducibility
Tables used: List all tables queried (e.g., crm.customers, sales.orders)
Clarifications made: Document any assumptions or clarifying questions asked
Confidence score: Optional confidence indicator based on retrieval scores and validation results

Example Answer for First-Time Buyer Churn

Answer:

Among first-time buyers who made their first purchase in Q3 2025, 234 customers churned within 30 days of their second order. The highest churn occurred in the Paid Search channel in the West region (87 customers), followed by Social Media in the East (52 customers). Common support ticket categories for churned customers included "Delivery Issues", "Product Quality", and "Refund Request".

Clarifications:

Q3 2025 defined as July 1 - September 30, 2025
Churn defined as status = 'inactive' in customer record
30-day window measured from second order date

Tables used: crm.customers, sales.orders, support.tickets

View SQL Query ↑

Effective Prompts for Production NL2SQL Systems

Well-crafted prompts are crucial for reliable natural language to SQL conversion. Here are battle-tested prompts for each stage of the workflow.

1. Schema Retrieval Scoring Prompt

Goal: Assess relevance of each table/column to the user question

You are a database schema relevance evaluator for natural language to SQL conversion.

User Question: {QUESTION}

Table/Column Metadata:
{SCHEMA_SNIPPET}

Task: Rate from 0.0 to 1.0 how useful this schema element is for answering the question.
Focus on business meaning and semantic relevance, not just keyword overlap.

Output JSON format:
{
  "relevance_score": 0.85,
  "reasoning": "One sentence explanation",
  "key_attributes": ["list", "of", "useful", "columns"]
}

2. Query Planning Prompt

Goal: Generate structured execution plan before SQL generation

You are a SQL query planner for {DATABASE_DIALECT}.

User Question: {QUESTION}

Available Tables and Schemas:
{SELECTED_SCHEMAS}

Task: Create a detailed, numbered execution plan for answering this question.

Include:
1. Which tables to query and why
2. Join strategy (keys, join types)
3. Filter conditions with specific column names
4. Derived fields or calculations needed
5. Aggregation dimensions and metrics
6. Ordering and result limits
7. One-line sanity check describing expected result grain

If business terms are ambiguous, list clarifying questions.
If data values are uncertain, suggest sample queries to run.

Output the plan in numbered list format.

3. SQL Generation Prompt with Examples

Goal: Generate syntactically correct, optimized SQL

You are an expert {DATABASE_DIALECT} query generator.

User Question: {QUESTION}

Query Plan:
{APPROVED_PLAN}

Schemas:
{SCHEMAS}

Sample Data (if available):
{SAMPLES}

Task: Generate a {DIALECT} SQL query that implements the plan.

Requirements:
- Use CTEs for complex logic (WITH clauses)
- Include explicit table.column qualifications
- Add WHERE clauses with time ranges for partitioned tables
- Apply appropriate LIMIT (100-1000 rows)
- Use window functions instead of self-joins when possible
- Follow {DIALECT} syntax exactly (e.g., DATEADD, ARRAY_AGG, DATE_TRUNC)
- Include comments for complex logic

Output: SQL query only, no explanation.

4. SQL Repair Prompt

Goal: Fix validation errors while preserving query intent

You are a SQL repair assistant for {DATABASE_DIALECT}.

Input:
- Original SQL: {SQL}
- Parser errors: {PARSE_ERRORS}
- Linter warnings: {LINT_WARNINGS}
- Safety violations: {SAFETY_ISSUES}
- Available schemas: {SCHEMAS}

Task: Fix all errors and produce corrected SQL.

Rules:
- Keep CTE names and structure unchanged
- Never add DML operations (INSERT, UPDATE, DELETE)
- Fix only the identified issues
- Maintain original query intent and logic
- Follow {DIALECT} syntax exactly

Output: Corrected SQL only, no explanation.

5. Answer Summarization Prompt

Goal: Convert SQL results into clear natural language answer

You are a data analyst summarizing SQL query results.

User Question: {QUESTION}

SQL Query Executed:
{SQL}

Query Results:
{RESULTS}

Tables Used: {TABLES}

Task: Write a clear, concise answer (2-3 paragraphs) that:
1. Directly answers the question with key metrics
2. Highlights important patterns or outliers
3. References specific numbers from results
4. Mentions any caveats or clarifications
5. Uses business language, not technical jargon

Output the natural language summary.

Performance Metrics and KPIs for NL2SQL Systems

Monitor these key metrics to ensure your natural language to SQL system maintains high quality in production.

Core Accuracy Metrics

Validation Failure Rate: Percentage of generated queries that fail parsing or linting (target: <5%)
Execution Success Rate: Percentage of validated queries that complete within timeout and limits (target: >95%)
Result Correctness: Percentage of queries returning correct results, measured via test set or user feedback (target: >85%)
Schema Retrieval Precision: Percentage of retrieved tables actually used in final SQL (target: >80%)
Schema Retrieval Recall: Percentage of necessary tables successfully retrieved (target: >95%)

Performance Metrics

End-to-End Latency: Time from question to answer, including retrieval, generation, validation, execution (target: <10s for P95)
Query Execution Time: Database query runtime (target: <5s for P95)
Token Usage: LLM tokens consumed per question for cost optimization
Repair Loop Iterations: Average number of validation-repair cycles (target: <1.5)

User Experience Metrics

Answer Satisfaction: User ratings on 5-point scale after seeing answer (target: >4.0 average)
SQL Acceptance Rate: Percentage of generated SQL queries user accepts without modification (target: >75%)
Clarification Rate: Percentage of questions requiring clarification (track but don't necessarily minimize—good clarifications improve accuracy)
Retry Rate: Percentage of questions users rephrase after unsatisfactory answer (target: <20%)

Operational Metrics

Error Rate: System errors, timeouts, exceptions (target: <1%)
Resource Violations: Queries exceeding row/timeout limits (target: <5%)
Security Violations: Attempts to execute DML or access forbidden tables (target: 0, alert on any)

Testing Strategy for NL2SQL

Golden test set: Maintain 100+ question-SQL pairs covering common patterns, edge cases, and multi-table joins
Regression testing: Run golden set on every deployment to catch accuracy regressions
A/B testing: Compare prompt variations, retrieval strategies, or models on real traffic
User feedback loop: Collect thumbs up/down with optional comments; review low-rated queries weekly

Frequently Asked Questions about Natural Language to SQL

What is Natural Language to SQL (NL2SQL)?

Natural Language to SQL (also called text-to-SQL or NL2SQL) is a technology that automatically converts questions written in human language into executable SQL queries. Modern NL2SQL systems use large language models (LLMs) combined with hybrid retrieval, validation, and safety controls to generate accurate, safe queries for production databases.

How does hybrid retrieval improve NL2SQL accuracy?

Hybrid retrieval combines two complementary search methods: BM25 for exact keyword matching and dense vector embeddings for semantic similarity. BM25 excels at finding tables/columns with matching names, while vectors understand conceptual relationships. A cross-encoder reranker then refines the top results. This multi-stage approach achieves higher precision than either method alone, typically improving schema retrieval recall by 15-30%.

What validation is needed for production NL2SQL systems?

Production NL2SQL requires multi-layer validation:

Parsing: Verify syntactic correctness using SQL parsers
Semantic validation: Confirm tables and columns exist in schema
Linting: Check code quality and best practices
Safety checks: Block DML, DDL, and risky operations
Performance checks: Warn on missing filters or Cartesian products
Execution safety: Read-only role, timeouts, row caps

Why create a query plan before generating SQL?

A step-by-step query plan improves accuracy by making the LLM explicitly reason about joins, filters, and aggregations before generating code. Plans help identify ambiguous terms early, allow for sample data inspection, and provide a sanity check for result grain. Studies show planning reduces SQL generation errors by 20-40% compared to direct generation, especially for complex multi-table queries.

Which LLMs work best for NL2SQL?

As of 2025, top-performing models for text-to-SQL include:

GPT-4 Turbo / GPT-4o: Excellent general-purpose accuracy, strong with complex queries
Claude 3.5 Sonnet: Very high accuracy, good at following constraints
Code Llama 70B / DeepSeek Coder: Strong open-source options, require fine-tuning
Specialized models: SQL-specific fine-tunes like SQLCoder, DIN-SQL adapters

Most production systems use GPT-4 or Claude for the planning/generation stages due to reliability, with smaller models for re-ranking or repair tasks.

How do you handle SQL dialects (Snowflake, BigQuery, PostgreSQL)?

Specify the target SQL dialect explicitly in all prompts (e.g., "Generate Snowflake SQL"). Use sqlglot to parse and transpile between dialects for validation. Maintain dialect-specific few-shot examples showing syntax differences (DATEADD vs DATE_ADD, ARRAY_AGG vs ARRAY_AGG, etc.). Test your prompt library against each dialect separately.

What are common failure modes in NL2SQL systems?

Hallucinated tables/columns: LLM invents non-existent schema elements—fix with strict validation
Incorrect join logic: Wrong join keys or join types—improve with planning stage
Ambiguous aggregation: Unclear GROUP BY dimensions—require explicit plan review
Performance issues: Missing WHERE clauses on large tables—enforce partition filters
Edge case handling: NULLs, duplicates, timezone issues—add to few-shot examples

How much does NL2SQL cost to run in production?

Typical cost per query in 2025:

Embedding generation: $0.0001-0.001 (for schema retrieval, can be cached)
LLM calls: $0.01-0.05 per query (planning + generation + potential repairs)
Database execution: $0.001-0.10 depending on warehouse and query complexity
Total per query: $0.01-0.15, heavily influenced by query complexity and required repair iterations

Optimize costs by caching schema embeddings, using smaller models for simple queries, and minimizing repair loops through better prompts.