~/blog

Synthesizing Evaluation Data for RAG Systems: A Deep Dive

May 23, 2026•16 min read•By Mohammed Vasim

RAGLLMdata-synthesisevaluationlangchainnvidia-api

The Problem We're Solving

Evaluating RAG (Retrieval-Augmented Generation) systems is challenging. You need:

Ground truth Q&A pairs that accurately reflect your documents
Realistic questions users would actually ask
Context to verify answers were derived from documents, not hallucinations
Scale — manually creating test cases is impractical

This notebook automates all of this by leveraging LLMs to generate contextually relevant questions and answers from PDF documents.

The Pipeline

PDFs → Markdown → Chunks → LLM (Agent) → Q&A Pairs → DataFrame → CSV

Each step serves a specific purpose in creating high-quality evaluation data automatically.

python

!pip install chromadb@>=1.4.1 deepeval@>=3.8.3 ipykernel@>=7.2.0 ipywidgets@>=8.1.8 langchain@>=1.2.7 langchain-community@>=0.4.1 langchain-core@>=1.2.7 langchain-openai@>=1.1.7 langgraph@>=1.0.7 pandas@>=3.0.3 pymupdf4llm@>=1.27.2.3 pypdf@>=6.6.2 tiktoken@>=0.12.0

Step 1: Dependencies & Setup

We need specific tools for our pipeline. Let's install them with pinned versions for reproducibility.

What We're Installing

Think of this as gathering all the tools you need before starting a big project. We're installing:

LangChain ecosystem — Tools to build LLM workflows (LangChain, LangChain-OpenAI, LangChain-Community)
PDF Processing — PyMuPDF4LLM (converts PDFs to clean markdown) and PyPDF (for PDF manipulation)
Data Handling — Pandas for organizing results into tables
Vector Databases — ChromaDB (though we won't use it here, good to have)
Evaluation — DeepEval for assessing LLM output quality

We pin versions (like @>=1.4.1) to ensure reproducibility. Run the same code a year from now and you'll get the same versions, avoiding mysterious bugs from library updates.

python

"""Implementing data synthizing for RAG evaluation"""

import os

Initializing the Environment

Here we're just importing the essentials and setting up our workspace. We start with:

os — To read environment variables (like our API key)

Think of this as opening your toolkit and checking that everything is at hand before we start working.

Step 2: API Configuration

We use NVIDIA's API instead of OpenAI directly for several reasons:

Cost — Cheaper endpoints for development
Local-like performance — Free tier available
Model access — Open-source models like gpt-oss-120b
Embeddings — Optimized embedding models like nv-embed-v1

The OpenAI client library is compatible with NVIDIA's API, so we just change the base URL.

Setting Up Your API Keys & Model Names

Think of this as configuring your delivery address. We're telling the system:

Where to send requests — NVIDIA's API endpoint (not OpenAI's default)
How to authenticate — Your API key from the environment
Which models to use — Embeddings model for creating vectors, chat model for generating Q&A

Why NVIDIA instead of OpenAI? They offer:

Lower costs — Especially good for development and batch processing
Open-source models — Like GPT-OSS-120B (faster, cheaper, fully open)
Optimized embeddings — NVIDIA's nv-embed-v1 is specifically tuned for semantic search

Your API key is stored as an environment variable (NVIDIA_API_KEY) for security — never hardcode secrets in notebooks!

python

BASE_URL = "https://integrate.api.nvidia.com/v1"
API_KEY = os.getenv("NVIDIA_API_KEY")
EMBED_MODEL_NAME = "nvidia/nv-embed-v1"
CHAT_MODEL_NAME = "openai/gpt-oss-120b"

python

from langchain_openai import ChatOpenAI

# Replace these with real values
chat_model = ChatOpenAI(
    model="openai/gpt-oss-120b",
    api_key=API_KEY,
    base_url=BASE_URL
)

Creating Your Chat Model Object

Here's where we actually connect to the API. ChatOpenAI from LangChain is a wrapper that:

Abstracts away the complexity of making API calls
Handles message formatting automatically
Provides a clean Python interface

We're pointing it to NVIDIA's API by overriding the base_url. This gives us the familiar OpenAI interface but with NVIDIA's cheaper, faster models in the background.

Think of it like this: It's like having a translator who speaks the same language as OpenAI's API, but redirects your requests to NVIDIA's servers instead.

python

chat_model.invoke("hello")

Testing the Connection

Before we process hundreds of documents, let's make sure everything works. This is a smoke test — a quick sanity check that:

The API credentials are valid
The network connection works
The model endpoint is reachable

If this fails, we catch the error early before wasting time. It's like testing your car starts before driving cross-country.

python

files = [
    "/Users/vasim/Downloads/Data Engineering for Foundation Models: The Alchemist’s Cookbook | Mohammed Vasim.pdf",
    "/Users/vasim/Downloads/The Tale of Meaningful Vectors: Contrastive Learning for Text Embeddings, Told with Pen and Paper | Mohammed Vasim.pdf",
]

Specifying Your Input Documents

Here we define which PDFs to mine for Q&A pairs. In this example, we're using:

Two technical papers about machine learning and data engineering

Key insight: The quality and relevance of your evaluation data depends on the documents you choose. If your PDFs are about data engineering, you'll generate Q&A pairs that are hyper-relevant for evaluating a data engineering RAG system.

Change this list to point to your own domain documents for domain-specific evaluation data.

python

import random
import string
import pymupdf
pymupdf.layout

import pymupdf4llm

Importing PDF & Utility Libraries

We're gathering the tools we need for PDF processing:

pymupdf — Lower-level PDF manipulation (gives us granular control)
pymupdf4llm — Higher-level API that converts PDFs directly to markdown (perfect for LLMs)
random & string — For generating random IDs (useful for tracking generated pairs)

The beauty of pymupdf4llm is that it understands PDFs are meant for humans to read, so it preserves structure, headings, and layout in the markdown output. No garbled text!

Step 3: PDF to Markdown Conversion

Why convert PDFs to markdown instead of raw text?

LLM-friendly — Markdown is semantic and preserves structure
Preserves hierarchy — Headings and sections remain meaningful
Handles images — Can optionally extract and catalog images
Better extraction — Preserves relationships between content

This transformation is crucial because LLMs work better with structured, semantic content.

Utility: Generating Random IDs

This is a simple helper function that creates unique random strings. Why?

When you generate hundreds of Q&A pairs, having unique IDs helps with:

Tracking — Know which pair came from which run
Deduplication — If you re-run the pipeline, you can filter out pairs you've already generated
Audit trails — Keep a record of what was generated when

The function creates strings like "A7K9P2" — useful for human-readable identifiers.

python

def generate_rand_string(length=6):
    """Generate random string"""
    return "".join(random.choices(
        string.ascii_letters + string.digits, k=length
        )).upper()

print(generate_rand_string())

python

def convert_pdf_to_md(files: list[str], write_images: bool = False):
    """Convert pdf to markdown"""
    content = {}

    for file in files:
        filename = os.path.basename(file)
        img_path = f"../data/images/{filename}"
        content[filename] = pymupdf4llm.to_markdown(
            file, image_path=img_path, write_images=write_images
        )

        if write_images:
            content[img_path] = [
                os.path.join(img_path, image_path)
                for image_path in os.listdir(img_path)
            ]

    return content

The Core Converter: PDF → Markdown

This function does the heavy lifting of transforming PDFs into a format that LLMs love. Here's what's happening:

What it does:

Loops through each PDF file
Extracts the filename
Converts the PDF to markdown using pymupdf4llm
Optionally extracts and catalogs images
Returns everything as a dictionary

Why markdown?

Semantic — Preserves headings, bold text, lists, code blocks
LLM-friendly — Models understand markdown structure better than raw text
Lossless — No information is lost in the conversion
Searchable — Easy to find specific sections later

Think of it like this: Converting a PDF to markdown is like transcribing a handwritten document into clean typed text — the content is the same, but now it's in a format computers and LLMs can work with easily.

python

content = convert_pdf_to_md(files)

Running the Conversion

Here's where the PDFs become markdown. The output is a dictionary where:

Keys = filename (like "The Tale of Meaningful Vectors.pdf")
Values = the entire PDF converted to markdown text

This typically takes a few seconds to a minute depending on PDF size. Once complete, you have clean, structured text ready for processing.

python

content

Inspecting the Converted Content

This cell displays what we got back from the PDF conversion. You should see:

Well-formed markdown with headings, sections, and structure intact
No gibberish or OCR errors
Readable, semantic content

Run this to verify the conversion worked well before moving to the next step. If you see formatting issues here, the later Q&A generation will suffer.

python

import sys
sys.path.insert(0, os.path.abspath(".."))

Setting Up Python Path

This allows us to import modules from the parent directory. We're assuming there's a utils.py file one level up that contains helper functions like chunk_text().

Why? It keeps code organized:

Reusable utilities go in utils.py
Specific notebook logic stays in the notebook
Easy to share utilities across multiple notebooks

python

from utils import chunk_text

Importing the Chunking Function

chunk_text() is a custom utility that intelligently breaks documents into smaller pieces. It's typically built to:

Respect semantic boundaries — Splits at section breaks, not in the middle of sentences
Keep context together — Related information stays in the same chunk
Handle variable sizes — Chunks can be different sizes to preserve meaning

This is smarter than naive approaches like "split every 500 tokens" — it understands that a paragraph about one concept should stay together.

python

chunks = []

for value in content.values():
    chunks.extend(chunk_text(value))

Breaking Documents Into Chunks

This is where we take the full markdown content and split it into bite-sized pieces. Each chunk is:

Small enough to fit comfortably in an LLM's context window
Large enough to contain complete thoughts and ideas
Semantically coherent — related information stays together

The result is a list of text chunks, something like:

[
  "## Section 1\n\nContent here...",
  "## Section 2\n\nMore content...",
  ...
]

Each chunk will later become one Q&A pair, so if you have 50 chunks, you'll get ~50 Q&A pairs. More chunks = more comprehensive evaluation data.

Step 4: Chunking Content

Why do we need to chunk documents?

LLM context limits — Can't feed entire documents at once (context window is limited)
Focused Q&A — Smaller chunks generate more precise, grounded questions
Parallelizable — Each chunk can be processed independently
Higher quality — Focused context produces better evaluation data

Each chunk is small enough to fit in a prompt while large enough to contain complete ideas.

python

chunks

Inspecting Your Chunks

Run this to see what you're working with:

How many chunks were created?
Are they reasonable sizes?
Any empty chunks?
Is the content making sense?

This is a quality check. If you see fragmented or weird chunks, the chunk_text() function might need tuning. Better to catch issues now than after processing thousands of pairs!

python

from uuid import uuid4, UUID
from pydantic import BaseModel, Field

Setting Up Data Structures

We're preparing to work with structured data. Here's what we import:

UUID — Universally Unique Identifiers (guarantees each Q&A pair has a unique ID)
Pydantic — A library for defining data schemas with validation

Think of Pydantic as creating a blueprint: "Every Q&A pair must have exactly these fields, with these types, and they must be valid."

python

class QAPair(BaseModel):
    """QAPair"""

    id: UUID = Field(description="UUID")
    query: str = Field(description="User query")
    ai_response: str = Field(description="AI response")
    context: str = Field(description="Context used by AI for response")

Defining the Q&A Pair Schema

This is our blueprint for generated Q&A pairs. Each pair must have exactly these four things:

What each field means:

id — Unique identifier (UUID) for this pair
query — The question a user would ask (e.g., "What is embeddings?")
ai_response — The correct answer (grounded in the source text)
context — The exact chunk from the document this pair came from

Why define this?

Type safety — Forces the LLM to return well-formed data
Validation — Pydantic checks each field is the right type
Documentation — Makes it clear what data we're working with
Serialization — Easy to convert to JSON/CSV later

Think of it as a contract: "I'm expecting Q&A pairs that look exactly like this, no substitutions."

python

SYSTEM_PROMPT = """
You are an AI assistant specialized in creating evaluation data for retrieval‑augmented generation (RAG) systems.
You will receive a TEXT chunk.
Your task is to create a realistic, standalone question that can be answered **only** using the information in the TEXT.
Then, provide the correct answer to that question.
The question must sound like something a real user would ask (vary the phrasing: factual, comparative, list, definition, yes/no, etc.).
The answer must be entirely grounded in the TEXT – do not add external knowledge.

Output a single Pydantic object with exactly these four keys:
- "id": a short unique identifier (e.g., "1", "2", ...). Use a simple incrementing number starting from 1 for each pair you generate.
- "question": the generated question.
- "answer": the ground‑truth answer.
- "context": the exact TEXT you were given (copy it verbatim).

Now generate the Pydantic object.
"""

The Magic: The System Prompt

This is prompt engineering in action. We're giving the LLM precise instructions on what to do:

The Recipe:

Context — "You're creating RAG evaluation data"
Task — "Generate questions that can ONLY be answered using the given text"
Quality — "Make them sound like real user questions; vary the types"
Grounding — "No external knowledge; stick to the text"
Output — "Return structured data matching the QAPair schema"

Why each instruction matters:

"Only using the information in the TEXT" — Prevents hallucinations. We want answers grounded in your documents, not LLM training data.
"Vary the phrasing" — Factual, comparative, list, definition, yes/no questions make better evaluation data
"Exactly these four keys" — Forces structured output we can parse and validate

This is the most important part of the whole pipeline. A good prompt → good data. A vague prompt → garbage data. Spending time here pays dividends.

Step 5: The Core Logic — Q&A Generation

This is where the magic happens. We use prompt engineering to instruct the LLM to:

Generate realistic questions — "sound like something a real user would ask"
Ensure groundedness — Answer must come from the provided text only
Vary question types — Factual, comparative, list-based, definitions, yes/no
Return structured data — Exactly matching our Pydantic schema

Why Structured Output?

Using Pydantic ensures:

Type safety — Forces consistent data structure
Validation — All fields present and correct types
Serialization — Easy conversion to JSON/CSV
Error catching — Schema mismatches fail early

python

from langchain.agents import create_agent

agent = create_agent(
    model=chat_model,
    system_prompt=SYSTEM_PROMPT,
    response_format=QAPair
)

Creating the Q&A Generation Agent

An agent is like a smart assistant that:

Follows your system prompt faithfully
Formats messages correctly for the LLM
Validates output against the QAPair schema
Handles errors gracefully

By wrapping everything in an agent, we get:

Consistency — Same behavior every time
Error handling — Knows what to do if something goes wrong
Output validation — Ensures we get valid QAPair objects back

Think of it as creating a specialized assistant whose only job is: "Take text, make Q&A pairs that follow these exact rules."

python

agent_output = []

Preparing Output Collection

We create an empty list to collect all the Q&A pairs as we generate them. After processing all chunks, this list will contain something like:

[
  {"structured_response": QAPair(...), ...},
  {"structured_response": QAPair(...), ...},
  ...
]

We'll later convert this into a DataFrame and export to CSV.

python

for chunk in chunks:
    response = agent.invoke({"messages": [{"role": "user", "content": chunk}]})
    agent_output.append(response)

The Main Loop: Generating Q&A Pairs

This is where the magic happens. For each chunk of text:

Send it to the agent with our system prompt
The agent generates a Q&A pair using the LLM
Store the result in our output list

What's happening under the hood:

The agent takes the chunk and wraps it in a message
Sends it to the LLM (gpt-oss-120b via NVIDIA API)
Gets back structured data (a QAPair object)
Validates it against our schema
Stores it

Timeline:

If you have 50 chunks, this runs 50 times
Each call takes ~5-10 seconds (depends on LLM latency)
Total runtime: ~5-10 minutes for 50 chunks

You'll see progress as it runs. Patient processing, but the results are worth it!

Processing Each Chunk

The agent loop generates one Q&A pair per chunk. This approach is:

Fully automated — No manual annotation required
Scalable — Can process thousands of chunks
Reproducible — Same documents → same output (with deterministic settings)
Grounded — Each pair is tied to specific source text

As we process, we collect all responses for later analysis and export.

python

agent_output

Inspecting All Outputs

This shows what we got back from all the agent calls. You'll see a list of responses, each containing a structured_response field with a QAPair object.

At this point, all generation is done. The rest is just formatting and exporting.

python

sample_output = agent_output[0]["structured_response"]
sample_output

Extracting a Sample Q&A Pair

Let's look at the first generated pair to see what we're working with. This is a quality check:

Is the question realistic?
Is the answer grounded in the context?
Does it look useful for evaluation?

Running this extracts the QAPair object from the first agent response.

python

print(f"ID: {sample_output.id}")
print(f"Query: {sample_output.query}")
print(f"Answer: {sample_output.ai_response}")
print(f"Context: {sample_output.context}")

Pretty-Printing the Sample

This makes the sample pair human-readable. You get:

ID — Unique identifier
Query — The generated question
Answer — The grounded answer
Context — The source text it came from

This is what one evaluation pair looks like. Inspect it carefully to ensure quality before processing your full dataset.

python

sample_output.model_dump()

Converting to Dictionary

model_dump() converts the Pydantic object into a plain Python dictionary. This is necessary for:

Converting to JSON
Putting into a DataFrame
Exporting to CSV

It transforms:

QAPair(id=UUID(...), query="...", ai_response="...", context="...")

Into:

{"id": "...", "query": "...", "ai_response": "...", "context": "..."}

Much more portable and serializable!

python

import pandas as pd

Importing Pandas

Pandas is the Python data analysis library. We're about to organize all our Q&A pairs into a structured table (DataFrame), which makes it easy to:

Inspect and filter data
Export to CSV
Analyze statistics
Share with evaluation frameworks

python

data = [item["structured_response"].model_dump() for item in agent_output]
data

Converting All Q&A Pairs to Dictionaries

This is a list comprehension that extracts every structured_response from all agent outputs and converts each to a dictionary.

Result: A list of plain dictionaries, ready to become rows in a DataFrame:

[
  {"id": "...", "query": "...", "ai_response": "...", "context": "..."},
  {"id": "...", "query": "...", "ai_response": "...", "context": "..."},
  ...
]

python

df = pd.DataFrame(data)
df.head()

Creating a DataFrame

This transforms our list of dictionaries into a structured table. Running df.head() shows you the first 5 rows, which looks something like:

id	query	ai_response	context
UUID-1	What is...?	The answer is...	Full chunk text...
UUID-2	How does...?	It works by...	Full chunk text...
UUID-3	Why do...?	Because...	Full chunk text...

This is your evaluation dataset in tabular form. Beautiful, structured, and ready for analysis or export.

python

from datetime import datetime

def get_datetime():
    """Get datetime"""
    return datetime.now().strftime("%Y%m%d_%H%M%S")

Creating Timestamped Filenames

This helper generates a timestamp like 20260523_143022 (May 23, 2026 at 14:30:22).

Why?

Unique filenames — Never accidentally overwrite previous results
Audit trail — See exactly when each dataset was generated
Versioning — Easy to compare different runs side-by-side

Each run produces a new file, so you can:

Iterate on prompts and compare results
Keep a history of datasets
Track improvements over time

Step 6: Export & Versioning

We save the evaluation dataset to CSV with timestamped filenames. Benefits:

Portable — Works with any tool (Excel, Python, evaluation frameworks)
Version controlled — Timestamped files prevent accidental overwrites
Human readable — Easy to inspect, filter, and validate
Audit trail — Can track what data was generated when

Each run produces a unique file like 20260523_143022.csv, making it easy to iterate and compare results.

python

df.to_csv(f"../data/eval-sets/{get_datetime()}.csv")

Exporting Your Dataset

This saves your entire DataFrame to a CSV file with a timestamped name like:

../data/eval-sets/20260523_143022.csv

What you get:

A standard CSV file (opens in Excel, Google Sheets, Python, etc.)
One row per Q&A pair
Columns: id, query, ai_response, context
Ready for import into evaluation frameworks (DeepEval, Ragas, etc.)

Next steps with this data:

Inspect it — Open in Excel, verify quality
Evaluate — Use with RAG evaluation frameworks
Iterate — Adjust prompts, regenerate, compare results
Version control — Store alongside your model versions
Improve — Use results to refine your RAG pipeline

Congratulations! You've automatically generated a dataset that would take hours to create manually. 🎉

Summary: Why This Approach Matters

The Complete Pipeline

PDFs → Markdown → Chunks → LLM (Agent) → Q&A Pairs → DataFrame → CSV

Key Advantages

Fully automated — Minimal manual effort required
Grounded — Every answer is tied to source text (prevents hallucinations)
Scalable — Process hundreds or thousands of documents
Reproducible — Same inputs produce consistent outputs
Structured — Pydantic ensures data quality and consistency
Exportable — CSV works with any evaluation framework

Why This Matters for RAG

Evaluation is critical — How do you know if your RAG system works?
Ground truth is expensive — This automates the creation process
Scale matters — Hundreds of test cases catch edge cases and failure modes
Grounding prevents hallucinations — We verify answers against source documents
Reproducibility — You can re-generate identical datasets for regression testing

Next Steps

Integrate with evaluation frameworks — Use DeepEval or Ragas to assess RAG quality
Add question variety filters — Don't just generate factual questions; include comparative, analytical, etc.
Implement quality gates — Filter out low-quality pairs programmatically
Version control datasets — Store evaluation CSVs alongside model versions
Feedback loops — Use evaluation results to improve your RAG retrieval/generation

This approach transforms a tedious manual task into a reliable, scalable process for continuous RAG evaluation. 🚀

Synthesizing Evaluation Data for RAG Systems: A Deep Dive

The Problem We're Solving

The Pipeline

Step 1: Dependencies & Setup

What We're Installing

Initializing the Environment

Step 2: API Configuration

Setting Up Your API Keys & Model Names

Creating Your Chat Model Object

Testing the Connection

Specifying Your Input Documents

Importing PDF & Utility Libraries

Step 3: PDF to Markdown Conversion

Utility: Generating Random IDs

The Core Converter: PDF → Markdown

Running the Conversion

Inspecting the Converted Content

Setting Up Python Path

Importing the Chunking Function

Breaking Documents Into Chunks

Step 4: Chunking Content

Inspecting Your Chunks

Setting Up Data Structures

Defining the Q&A Pair Schema

The Magic: The System Prompt

Step 5: The Core Logic — Q&A Generation

Why Structured Output?

Creating the Q&A Generation Agent

Preparing Output Collection

The Main Loop: Generating Q&A Pairs

Processing Each Chunk

Inspecting All Outputs

Extracting a Sample Q&A Pair

Pretty-Printing the Sample

Converting to Dictionary

Importing Pandas

Converting All Q&A Pairs to Dictionaries

Creating a DataFrame

Creating Timestamped Filenames

Step 6: Export & Versioning

Exporting Your Dataset

Summary: Why This Approach Matters

The Complete Pipeline

Key Advantages

Why This Matters for RAG

Next Steps

Stay in the loop

Related Posts

Comments (0)

Leave a comment