Generating Structured Synthetic Data
Note
To download this tutorial as a Jupyter notebook, click here.
In this example, we'll generate structured dummy data for a pandas
dataframe.
We make the assumption that:
- We don't need any external libraries that are not already installed in the environment.
- We are able to execute the code in the environment.
Objective
We want to generate structured synthetic data, where each column has a specific data type. All rows in the dataset must respect the column data types. Additionally, we have some more constraints we want the data to respect:
- There should be exactly 10 rows in the dataset.
- Each user should have a first name and a last name.
- The number of orders associated with each user should be between 0 and 50.
- Each user should have a most recent order date.
Step 1: Generating RAIL
Spec
Ordinarily, we could create a separate RAIL
spec in a file. However, for the sake of this example, we will generate the RAIL
spec in the notebook as a string. We will also show the same RAIL spec in a code-first format using a Pydantic model.
RAIL spec as an XML string:
rail_str = """
<rail version="0.1">
<output>
<list description="Generate a list of user, and how many orders they have placed in the past." format="length: 10 10" name="user_orders" on-fail-length="noop">
<object>
<string description="The user's id." format="1-indexed" name="user_id"></string>
<string description="The user's first name and last name" format="two-words" name="user_name"></string>
<integer description="The number of orders the user has placed" format="valid-range: 0 50" name="num_orders"></integer>
<date description="Date of last order" name="last_order_date"></date>
</object>
</list>
</output>
<prompt>
Generate a dataset of fake user orders. Each row of the dataset should be valid.
${gr.complete_json_suffix}</prompt>
</rail>
"""
Rail spec as a Pydantic model:
from pydantic import BaseModel, Field
from guardrails.validators import ValidLength, TwoWords, ValidRange
from datetime import date
from typing import List
prompt = """
Generate a dataset of fake user orders. Each row of the dataset should be valid.
${gr.complete_json_suffix}"""
class Order(BaseModel):
user_id: str = Field(description="The user's id.", validators=[("1-indexed", "noop")])
user_name: str = Field(
description="The user's first name and last name",
validators=[TwoWords()]
)
num_orders: int = Field(
description="The number of orders the user has placed",
validators=[ValidRange(0, 50)]
)
last_order_date: date = Field(description="Date of last order")
class Orders(BaseModel):
user_orders: List[Order] = Field(
description="Generate a list of user, and how many orders they have placed in the past.",
validators=[ValidLength(10, 10, on_fail="noop")]
)
Step 2: Create a Guard
object with the RAIL Spec
We create a gd.Guard
object that will check, validate and correct the generated code. This object:
- Enforces the quality criteria specified in the RAIL spec (i.e. bug free code).
- Takes corrective action when the quality criteria are not met (i.e. reasking the LLM).
- Compiles the schema and type info from the RAIL spec and adds it to the prompt.
From our RAIL string:
From our Pydantic model:
The Guard
object compiles the output schema and adds it to the prompt. We can see the final prompt below:
Step 3: Wrap the LLM API call with Guard
Running the cell above returns:
1. The raw LLM text output as a single string.
2. A dictionary where the key user_orders
key contains a list of dictionaries, where each dictionary represents a row in the dataframe.