Skip to content

Text-to-SQL

The Text2SQL module is a demonstration of how to use Guardrails to build a text-to-SQL workflow. It implements necessary components to build a text-to-SQL workflow, including:

  • Connects to your database
  • Given a DB schema, supports SQL validation and error handling
  • Finds relevant few shot examples of text2SQL history, and uses them to generate a prompt
  • Optionally adds modules to guard against unsafe SQL (allowing INSERT, DROP, etc.)
  • Supports reasking for all invalid SQL (SQL with syntax errors, SQL targeting non-existent tables, SQL with prohibited keywords, etc.)

Info

The text2SQL module is intended to showcase what an end-to-end text2SQL workflow looks like with Guardrails. As a developer, you can either use this module as a starting point for your own text2SQL workflow or you can use it as a reference for how to use Guardrails to build your own custom text2SQL workflow.

🎱 Why use Guardrails for text-to-SQL?

  1. bug-free-sql validators: Guardrails provides a bug-free-sql validator that creates a sandboxed environment to run the generated SQL against your database and check for bugs and errors.
  2. Protection against unsafe SQL: Guardrails allows developers to constrain the generated SQL to a subset of SQL that is safe to run against your database. For example, you can configure Guardrails to only allow SELECT statements, and disallow INSERT, DROP, etc.
  3. Reasking: Guardrails allows you to configure reasking logic, so that if the generated SQL is invalid or has bugs, you can reask the LLM to generate a new SQL. Guardrails automatically handles the reasking logic for you.

🚀 Quickstart

To get started, you'll need the following: 1. Either a connection string to your database or a database schema 2. (Optional) Dictionary of few shot examples of text2SQL history

import os
import json

from rich import print

from guardrails.applications.text2sql import Text2Sql

os.environ['OPENAI_API_KEY'] = 'YOUR_API_KEY'

📝 Setup database connection and examples

EXAMPLES = "examples.json"
SQL_SCHEMA = "schema.sql"
# Alternatively, you can specify a connection string instead of a schema file, like so:
# SQL_CONN = f"sqlite:///{os.getcwd()}department_management.sqlite"

with open(EXAMPLES, "r") as f:
    examples = json.load(f)

print(examples[:2])
[
    {
        'question': 'How many heads of the departments are older than 56 ?',
        'query': 'SELECT count(*) FROM head WHERE age  >  56'
    },
    {
        'question': 'List the name, born state and age of the heads of departments ordered by age.',
        'query': 'SELECT name ,  born_state ,  age FROM head ORDER BY age'
    }
]

💡 Use Text2SQL Application

  • Sets up sandboxed DB based on your SQL schema / connection string
  • Finds most relevant examples and inserts them into the prompt
  • Checks that the query is valid for the schema
from guardrails.applications.text2sql import Text2Sql

app = Text2Sql(
    "sqlite://",
    schema_file=SQL_SCHEMA,
    examples=examples,
)

# Call the application with a natural language question.
print(app("What is the name of the department with the highest number of employees?"))
SELECT name FROM department ORDER BY Num_Employees DESC LIMIT 1

❎ Demonstration on an incorrect output

Below is a demonstration of how Guardrails' Text2SQL module handles an incorrect output from the LLM. The LLM generates an SQL that has a syntax error, and Guardrails catches this error and reasks the LLM to generate a new SQL.

The output is incorrect because it refers to a table (departments) that does not exist in the database. Guardrails catches this error and reasks the LLM to generate a new SQL.

Incorrect output:

{"generated_sql": "SELECT name FROM departments ORDER BY num_employees DESC LIMIT 1"}

Below, we can see the final corrected output, as well as the underlying logs to see what happened step by step:

Correct output:

SELECT Name FROM department ORDER BY Num_Employees DESC LIMIT 1
Logs
├── ╭────────────────────────────────────────────────── Step 0 ───────────────────────────────────────────────────╮
│   │ ╭──────────────────────────────────────────────── Prompt ─────────────────────────────────────────────────╮ │
│   │ │ No prompt                                                                                               │ │
│   │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
│   │ ╭──────────────────────────────────────────── Raw LLM Output ─────────────────────────────────────────────╮ │
│   │ │ {"generated_sql": "SELECT name FROM departments ORDER BY num_employees DESC LIMIT 1"}                   │ │
│   │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
│   │ ╭─────────────────────────────────────────── Validated Output ────────────────────────────────────────────╮ │
│   │ │ {                                                                                                       │ │
│   │ │     'generated_sql': ReAsk(                                                                             │ │
│   │ │         incorrect_value='SELECT name FROM departments ORDER BY num_employees DESC LIMIT 1',             │ │
│   │ │         error_message='(sqlite3.OperationalError) no such table: departments\n[SQL: SELECT name FROM    │ │
│   │ │ departments ORDER BY num_employees DESC LIMIT 1]\n(Background on this error at:                         │ │
│   │ │ https://sqlalche.me/e/20/e3q8)',                                                                        │ │
│   │ │         fix_value=None,                                                                                 │ │
│   │ │         path=['generated_sql']                                                                          │ │
│   │ │     )                                                                                                   │ │
│   │ │ }                                                                                                       │ │
│   │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
│   ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
└── ╭────────────────────────────────────────────────── Step 1 ───────────────────────────────────────────────────╮
    │ ╭──────────────────────────────────────────────── Prompt ─────────────────────────────────────────────────╮ │
    │ │                                                                                                         │ │
    │ │ You are a data scientist whose job is to write SQL queries.                                             │ │
    │ │                                                                                                         │ │
    │ │                                                                                                         │ │
    │ │ Given below is XML that describes the information to extract from this document and the tags to extract │ │
    │ │ it into.                                                                                                │ │
    │ │                                                                                                         │ │
    │ │ <output>                                                                                                │ │
    │ │     <sql name="generated_sql" description="Generate SQL for the given natural language instruction."    │ │
    │ │ format="bug-free-sql"/>                                                                                 │ │
    │ │ </output>                                                                                               │ │
    │ │                                                                                                         │ │
    │ │                                                                                                         │ │
    │ │ ONLY return a valid JSON object (no other text is necessary), where the key of the field in JSON is the │ │
    │ │ `name` attribute of the corresponding XML, and the value is of the type specified by the corresponding  │ │
    │ │ XML's tag. The JSON MUST conform to the XML format, including any types and format requests e.g.        │ │
    │ │ requests for lists, objects and specific types. Be correct and concise.                                 │ │
    │ │                                                                                                         │ │
    │ │ Here are examples of simple (XML, JSON) pairs that show the expected behavior:                          │ │
    │ │ - `<string name='foo' format='two-words lower-case' />` => `{'foo': 'example one'}`                     │ │
    │ │ - `<list name='bar'><string format='upper-case' /></list>` => `{"bar": ['STRING ONE', 'STRING TWO',     │ │
    │ │ etc.]}`                                                                                                 │ │
    │ │ - `<object name='baz'><string name="foo" format="capitalize two-words" /><integer name="index"          │ │
    │ │ format="1-indexed" /></object>` => `{'baz': {'foo': 'Some String', 'index': 1}}`                        │ │
    │ │                                                                                                         │ │
    │ │                                                                                                         │ │
    │ │ Here's schema about the database that you can use to generate the SQL query.                            │ │
    │ │ Try to avoid using joins if the data can be retrieved from the same table.                              │ │
    │ │                                                                                                         │ │
    │ │ Table: department                                                                                       │ │
    │ │     Column: Department_ID                                                                               │ │
    │ │         type: INTEGER                                                                                   │ │
    │ │     Column: Name                                                                                        │ │
    │ │         type: TEXT                                                                                      │ │
    │ │     Column: Creation                                                                                    │ │
    │ │         type: TEXT                                                                                      │ │
    │ │     Column: Ranking                                                                                     │ │
    │ │         type: INTEGER                                                                                   │ │
    │ │     Column: Budget_in_Billions                                                                          │ │
    │ │         type: REAL                                                                                      │ │
    │ │     Column: Num_Employees                                                                               │ │
    │ │         type: REAL                                                                                      │ │
    │ │ Table: head                                                                                             │ │
    │ │     Column: head_ID                                                                                     │ │
    │ │         type: INTEGER                                                                                   │ │
    │ │     Column: name                                                                                        │ │
    │ │         type: TEXT                                                                                      │ │
    │ │     Column: born_state                                                                                  │ │
    │ │         type: TEXT                                                                                      │ │
    │ │     Column: age                                                                                         │ │
    │ │         type: REAL                                                                                      │ │
    │ │ Table: management                                                                                       │ │
    │ │     Column: department_ID                                                                               │ │
    │ │         type: INTEGER                                                                                   │ │
    │ │         foreign_key: {'table': 'department', 'column': 'Department_ID'}                                 │ │
    │ │     Column: head_ID                                                                                     │ │
    │ │         type: INTEGER                                                                                   │ │
    │ │         foreign_key: {'table': 'head', 'column': 'head_ID'}                                             │ │
    │ │     Column: temporary_acting                                                                            │ │
    │ │         type: TEXT                                                                                      │ │
    │ │                                                                                                         │ │
    │ │ I will give you a list of examples.                                                                     │ │
    │ │                                                                                                         │ │
    │ │                                                                                                         │ │
    │ │                                                                                                         │ │
    │ │ I want to create a query for the following instruction:                                                 │ │
    │ │                                                                                                         │ │
    │ │ What is the name of the department with the highest number of employees?                                │ │
    │ │                                                                                                         │ │
    │ │ For this instruction, I was given the following JSON, which has some incorrect values.                  │ │
    │ │                                                                                                         │ │
    │ │ {                                                                                                       │ │
    │ │   "generated_sql": {                                                                                    │ │
    │ │     "incorrect_value": "SELECT name FROM departments ORDER BY num_employees DESC LIMIT 1",              │ │
    │ │     "error_message": "(sqlite3.OperationalError) no such table: departments\n[SQL: SELECT name FROM     │ │
    │ │ departments ORDER BY num_employees DESC LIMIT 1]\n(Background on this error at:                         │ │
    │ │ https://sqlalche.me/e/20/e3q8)"                                                                         │ │
    │ │   }                                                                                                     │ │
    │ │ }                                                                                                       │ │
    │ │                                                                                                         │ │
    │ │ Help me correct the incorrect values based on the given error messages.                                 │ │
    │ │                                                                                                         │ │
    │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
    │ ╭──────────────────────────────────────────── Raw LLM Output ─────────────────────────────────────────────╮ │
    │ │ {"generated_sql": "SELECT Name FROM department ORDER BY Num_Employees DESC LIMIT 1"}                    │ │
    │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
    │ ╭─────────────────────────────────────────── Validated Output ────────────────────────────────────────────╮ │
    │ │ {                                                                                                       │ │
    │ │     'generated_sql': 'SELECT Name FROM department ORDER BY Num_Employees DESC LIMIT 1'                  │ │
    │ │ }                                                                                                       │ │
    │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
    ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

⚒️ Customizing the Text2SQL module

You can customize the Text2SQL module to fit your needs. Here are some examples of how you can customize the module:

  1. Prompt: You can customize the text2SQL prompt by passing in a custom RAIL spec to the module in the rail_spec argument.
  2. Few shot examples: You can customize how the few shot examples are stored, retrieved, and inserted into the prompt by updating the arguments vector_db, document_store, num_relevant_examples and example_formatter.
  3. Reask prompt: You can customize the reask prompt by updating the reask_prompt argument.
  4. LLM API: You can customize the LLM API by updating the llm_api argument. To pass in specific kwargs to the LLM API, you can update the llm_api_kwargs argument (e.g. llm_api_kwargs={'max_length': 100}).