Skip to content

Generating Structured Synthetic Data

Note

To download this tutorial as a Jupyter notebook, click here.

In this example, we'll generate structured dummy data for a pandas dataframe.

We make the assumption that:

  1. We don't need any external libraries that are not already installed in the environment.
  2. We are able to execute the code in the environment.

Objective

We want to generate structured synthetic data, where each column has a specific data type. All rows in the dataset must respect the column data types. Additionally, we have some more constraints we want the data to respect:

  1. There should be exactly 10 rows in the dataset.
  2. Each user should have a first name and a last name.
  3. The number of orders associated with each user should be between 0 and 50.
  4. Each user should have a most recent order date.

Step 1: Generating RAIL Spec

Ordinarily, we could create a separate RAIL spec in a file. However, for the sake of this example, we will generate the RAIL spec in the notebook as a string. We will also show the same RAIL spec in a code-first format using a Pydantic model.

RAIL spec as an XML string:

rail_str = """
<rail version="0.1">
<output>
<list description="Generate a list of user, and how many orders they have placed in the past." format="length: 10 10" name="user_orders" on-fail-length="noop">
<object>
<string description="The user's id." format="1-indexed" name="user_id"></string>
<string description="The user's first name and last name" format="two-words" name="user_name"></string>
<integer description="The number of orders the user has placed" format="valid-range: 0 50" name="num_orders"></integer>
<date description="Date of last order" name="last_order_date"></date>
</object>
</list>
</output>
<prompt>
Generate a dataset of fake user orders. Each row of the dataset should be valid.

${gr.complete_json_suffix}</prompt>
</rail>
"""

Rail spec as a Pydantic model:

from pydantic import BaseModel, Field
from guardrails.validators import ValidLength, TwoWords, ValidRange
from datetime import date
from typing import List

prompt = """
Generate a dataset of fake user orders. Each row of the dataset should be valid.

${gr.complete_json_suffix}"""

class Order(BaseModel):
    user_id: str = Field(description="The user's id.", validators=[("1-indexed", "noop")])
    user_name: str = Field(
        description="The user's first name and last name",
        validators=[TwoWords()]
    )
    num_orders: int = Field(
        description="The number of orders the user has placed",
        validators=[ValidRange(0, 50)]
    )
    last_order_date: date = Field(description="Date of last order")

class Orders(BaseModel):
    user_orders: List[Order] = Field(
        description="Generate a list of user, and how many orders they have placed in the past.",
        validators=[ValidLength(10, 10, on_fail="noop")]
    )

Step 2: Create a Guard object with the RAIL Spec

We create a gd.Guard object that will check, validate and correct the generated code. This object:

  1. Enforces the quality criteria specified in the RAIL spec (i.e. bug free code).
  2. Takes corrective action when the quality criteria are not met (i.e. reasking the LLM).
  3. Compiles the schema and type info from the RAIL spec and adds it to the prompt.
import guardrails as gd

from rich import print

From our RAIL string:

guard = gd.Guard.from_rail_string(rail_str)

From our Pydantic model:

guard = gd.Guard.from_pydantic(output_class=Orders, prompt=prompt)

The Guard object compiles the output schema and adds it to the prompt. We can see the final prompt below:

print(guard.base_prompt)
Generate a dataset of fake user orders. Each row of the dataset should be valid.


Given below is XML that describes the information to extract from this document and the tags to extract it into.

<output>
    <list name="user_orders" format="length: min=10 max=10" description="Generate a list of user, and how many 
orders they have placed in the past.">
        <object>
            <string name="user_id" format="1-indexed" description="The user's id."/>
            <string name="user_name" format="two-words" description="The user's first name and last name"/>
            <integer name="num_orders" format="valid-range: min=0 max=50" description="The number of orders the 
user has placed"/>
            <date name="last_order_date" description="Date of last order"/>
        </object>
    </list>
</output>


ONLY return a valid JSON object (no other text is necessary), where the key of the field in JSON is the `name` 
attribute of the corresponding XML, and the value is of the type specified by the corresponding XML's tag. The JSON
MUST conform to the XML format, including any types and format requests e.g. requests for lists, objects and 
specific types. Be correct and concise. If you are unsure anywhere, enter `null`.

Here are examples of simple (XML, JSON) pairs that show the expected behavior:
- `<string name='foo' format='two-words lower-case' />` => `{'foo': 'example one'}`
- `<list name='bar'><string format='upper-case' /></list>` => `{"bar": ['STRING ONE', 'STRING TWO', etc.]}`
- `<object name='baz'><string name="foo" format="capitalize two-words" /><integer name="index" format="1-indexed" 
/></object>` => `{'baz': {'foo': 'Some String', 'index': 1}}`

Step 3: Wrap the LLM API call with Guard

import openai

raw_llm_response, validated_response = guard(
    openai.Completion.create, engine="text-davinci-003", max_tokens=2048, temperature=0
)
Async event loop found, but guard was invoked synchronously.For validator parallelization, please call `validate_async` instead.

Running the cell above returns: 1. The raw LLM text output as a single string. 2. A dictionary where the key user_orders key contains a list of dictionaries, where each dictionary represents a row in the dataframe.

print(validated_response)
{
    'user_orders': [
        {'user_id': 1, 'user_name': 'John Smith', 'num_orders': 10, 'last_order_date': '2020-01-01'},
        {'user_id': 2, 'user_name': 'Jane Doe', 'num_orders': 20, 'last_order_date': '2020-02-01'},
        {'user_id': 3, 'user_name': 'Bob Jones', 'num_orders': 30, 'last_order_date': '2020-03-01'},
        {'user_id': 4, 'user_name': 'Alice Smith', 'num_orders': 40, 'last_order_date': '2020-04-01'},
        {'user_id': 5, 'user_name': 'John Doe', 'num_orders': 50, 'last_order_date': '2020-05-01'},
        {'user_id': 6, 'user_name': 'Jane Jones', 'num_orders': 0, 'last_order_date': '2020-06-01'},
        {'user_id': 7, 'user_name': 'Bob Smith', 'num_orders': 10, 'last_order_date': '2020-07-01'},
        {'user_id': 8, 'user_name': 'Alice Doe', 'num_orders': 20, 'last_order_date': '2020-08-01'},
        {'user_id': 9, 'user_name': 'John Jones', 'num_orders': 30, 'last_order_date': '2020-09-01'},
        {'user_id': 10, 'user_name': 'Jane Smith', 'num_orders': 40, 'last_order_date': '2020-10-01'}
    ]
}
print(guard.state.most_recent_call.tree)
Logs
└── ╭────────────────────────────────────────────────── Step 0 ───────────────────────────────────────────────────╮
    │ ╭──────────────────────────────────────────────── Prompt ─────────────────────────────────────────────────╮ │
    │ │                                                                                                         │ │
    │ │ Generate a dataset of fake user orders. Each row of the dataset should be valid.                        │ │
    │ │                                                                                                         │ │
    │ │                                                                                                         │ │
    │ │ Given below is XML that describes the information to extract from this document and the tags to extract │ │
    │ │ it into.                                                                                                │ │
    │ │                                                                                                         │ │
    │ │ <output>                                                                                                │ │
    │ │     <list name="user_orders" format="length: min=10 max=10" description="Generate a list of user, and   │ │
    │ │ how many orders they have placed in the past.">                                                         │ │
    │ │         <object>                                                                                        │ │
    │ │             <string name="user_id" format="1-indexed" description="The user's id."/>                    │ │
    │ │             <string name="user_name" format="two-words" description="The user's first name and last     │ │
    │ │ name"/>                                                                                                 │ │
    │ │             <integer name="num_orders" format="valid-range: min=0 max=50" description="The number of    │ │
    │ │ orders the user has placed"/>                                                                           │ │
    │ │             <date name="last_order_date" description="Date of last order"/>                             │ │
    │ │         </object>                                                                                       │ │
    │ │     </list>                                                                                             │ │
    │ │ </output>                                                                                               │ │
    │ │                                                                                                         │ │
    │ │                                                                                                         │ │
    │ │ ONLY return a valid JSON object (no other text is necessary), where the key of the field in JSON is the │ │
    │ │ `name` attribute of the corresponding XML, and the value is of the type specified by the corresponding  │ │
    │ │ XML's tag. The JSON MUST conform to the XML format, including any types and format requests e.g.        │ │
    │ │ requests for lists, objects and specific types. Be correct and concise. If you are unsure anywhere,     │ │
    │ │ enter `null`.                                                                                           │ │
    │ │                                                                                                         │ │
    │ │ Here are examples of simple (XML, JSON) pairs that show the expected behavior:                          │ │
    │ │ - `<string name='foo' format='two-words lower-case' />` => `{'foo': 'example one'}`                     │ │
    │ │ - `<list name='bar'><string format='upper-case' /></list>` => `{"bar": ['STRING ONE', 'STRING TWO',     │ │
    │ │ etc.]}`                                                                                                 │ │
    │ │ - `<object name='baz'><string name="foo" format="capitalize two-words" /><integer name="index"          │ │
    │ │ format="1-indexed" /></object>` => `{'baz': {'foo': 'Some String', 'index': 1}}`                        │ │
    │ │                                                                                                         │ │
    │ │                                                                                                         │ │
    │ │ Json Output:                                                                                            │ │
    │ │                                                                                                         │ │
    │ │                                                                                                         │ │
    │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
    │ ╭──────────────────────────────────────────── Message History ────────────────────────────────────────────╮ │
    │ │ ┏━━━━━━┳━━━━━━━━━┓                                                                                      │ │
    │ │ ┃ Role  Content ┃                                                                                      │ │
    │ │ ┡━━━━━━╇━━━━━━━━━┩                                                                                      │ │
    │ │ └──────┴─────────┘                                                                                      │ │
    │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
    │ ╭──────────────────────────────────────────── Raw LLM Output ─────────────────────────────────────────────╮ │
    │ │ {                                                                                                       │ │
    │ │     "user_orders": [                                                                                    │ │
    │ │         {                                                                                               │ │
    │ │             "user_id": 1,                                                                               │ │
    │ │             "user_name": "John Smith",                                                                  │ │
    │ │             "num_orders": 10,                                                                           │ │
    │ │             "last_order_date": "2020-01-01"                                                             │ │
    │ │         },                                                                                              │ │
    │ │         {                                                                                               │ │
    │ │             "user_id": 2,                                                                               │ │
    │ │             "user_name": "Jane Doe",                                                                    │ │
    │ │             "num_orders": 20,                                                                           │ │
    │ │             "last_order_date": "2020-02-01"                                                             │ │
    │ │         },                                                                                              │ │
    │ │         {                                                                                               │ │
    │ │             "user_id": 3,                                                                               │ │
    │ │             "user_name": "Bob Jones",                                                                   │ │
    │ │             "num_orders": 30,                                                                           │ │
    │ │             "last_order_date": "2020-03-01"                                                             │ │
    │ │         },                                                                                              │ │
    │ │         {                                                                                               │ │
    │ │             "user_id": 4,                                                                               │ │
    │ │             "user_name": "Alice Smith",                                                                 │ │
    │ │             "num_orders": 40,                                                                           │ │
    │ │             "last_order_date": "2020-04-01"                                                             │ │
    │ │         },                                                                                              │ │
    │ │         {                                                                                               │ │
    │ │             "user_id": 5,                                                                               │ │
    │ │             "user_name": "John Doe",                                                                    │ │
    │ │             "num_orders": 50,                                                                           │ │
    │ │             "last_order_date": "2020-05-01"                                                             │ │
    │ │         },                                                                                              │ │
    │ │         {                                                                                               │ │
    │ │             "user_id": 6,                                                                               │ │
    │ │             "user_name": "Jane Jones",                                                                  │ │
    │ │             "num_orders": 0,                                                                            │ │
    │ │             "last_order_date": "2020-06-01"                                                             │ │
    │ │         },                                                                                              │ │
    │ │         {                                                                                               │ │
    │ │             "user_id": 7,                                                                               │ │
    │ │             "user_name": "Bob Smith",                                                                   │ │
    │ │             "num_orders": 10,                                                                           │ │
    │ │             "last_order_date": "2020-07-01"                                                             │ │
    │ │         },                                                                                              │ │
    │ │         {                                                                                               │ │
    │ │             "user_id": 8,                                                                               │ │
    │ │             "user_name": "Alice Doe",                                                                   │ │
    │ │             "num_orders": 20,                                                                           │ │
    │ │             "last_order_date": "2020-08-01"                                                             │ │
    │ │         },                                                                                              │ │
    │ │         {                                                                                               │ │
    │ │             "user_id": 9,                                                                               │ │
    │ │             "user_name": "John Jones",                                                                  │ │
    │ │             "num_orders": 30,                                                                           │ │
    │ │             "last_order_date": "2020-09-01"                                                             │ │
    │ │         },                                                                                              │ │
    │ │         {                                                                                               │ │
    │ │             "user_id": 10,                                                                              │ │
    │ │             "user_name": "Jane Smith",                                                                  │ │
    │ │             "num_orders": 40,                                                                           │ │
    │ │             "last_order_date": "2020-10-01"                                                             │ │
    │ │         }                                                                                               │ │
    │ │     ]                                                                                                   │ │
    │ │ }                                                                                                       │ │
    │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
    │ ╭─────────────────────────────────────────── Validated Output ────────────────────────────────────────────╮ │
    │ │ {                                                                                                       │ │
    │ │     'user_orders': [                                                                                    │ │
    │ │         {                                                                                               │ │
    │ │             'user_id': 1,                                                                               │ │
    │ │             'user_name': 'John Smith',                                                                  │ │
    │ │             'num_orders': 10,                                                                           │ │
    │ │             'last_order_date': '2020-01-01'                                                             │ │
    │ │         },                                                                                              │ │
    │ │         {                                                                                               │ │
    │ │             'user_id': 2,                                                                               │ │
    │ │             'user_name': 'Jane Doe',                                                                    │ │
    │ │             'num_orders': 20,                                                                           │ │
    │ │             'last_order_date': '2020-02-01'                                                             │ │
    │ │         },                                                                                              │ │
    │ │         {                                                                                               │ │
    │ │             'user_id': 3,                                                                               │ │
    │ │             'user_name': 'Bob Jones',                                                                   │ │
    │ │             'num_orders': 30,                                                                           │ │
    │ │             'last_order_date': '2020-03-01'                                                             │ │
    │ │         },                                                                                              │ │
    │ │         {                                                                                               │ │
    │ │             'user_id': 4,                                                                               │ │
    │ │             'user_name': 'Alice Smith',                                                                 │ │
    │ │             'num_orders': 40,                                                                           │ │
    │ │             'last_order_date': '2020-04-01'                                                             │ │
    │ │         },                                                                                              │ │
    │ │         {                                                                                               │ │
    │ │             'user_id': 5,                                                                               │ │
    │ │             'user_name': 'John Doe',                                                                    │ │
    │ │             'num_orders': 50,                                                                           │ │
    │ │             'last_order_date': '2020-05-01'                                                             │ │
    │ │         },                                                                                              │ │
    │ │         {                                                                                               │ │
    │ │             'user_id': 6,                                                                               │ │
    │ │             'user_name': 'Jane Jones',                                                                  │ │
    │ │             'num_orders': 0,                                                                            │ │
    │ │             'last_order_date': '2020-06-01'                                                             │ │
    │ │         },                                                                                              │ │
    │ │         {                                                                                               │ │
    │ │             'user_id': 7,                                                                               │ │
    │ │             'user_name': 'Bob Smith',                                                                   │ │
    │ │             'num_orders': 10,                                                                           │ │
    │ │             'last_order_date': '2020-07-01'                                                             │ │
    │ │         },                                                                                              │ │
    │ │         {                                                                                               │ │
    │ │             'user_id': 8,                                                                               │ │
    │ │             'user_name': 'Alice Doe',                                                                   │ │
    │ │             'num_orders': 20,                                                                           │ │
    │ │             'last_order_date': '2020-08-01'                                                             │ │
    │ │         },                                                                                              │ │
    │ │         {                                                                                               │ │
    │ │             'user_id': 9,                                                                               │ │
    │ │             'user_name': 'John Jones',                                                                  │ │
    │ │             'num_orders': 30,                                                                           │ │
    │ │             'last_order_date': '2020-09-01'                                                             │ │
    │ │         },                                                                                              │ │
    │ │         {                                                                                               │ │
    │ │             'user_id': 10,                                                                              │ │
    │ │             'user_name': 'Jane Smith',                                                                  │ │
    │ │             'num_orders': 40,                                                                           │ │
    │ │             'last_order_date': '2020-10-01'                                                             │ │
    │ │         }                                                                                               │ │
    │ │     ]                                                                                                   │ │
    │ │ }                                                                                                       │ │
    │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
    ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯