Back to blog

~/blog

Synthesizing Evaluation Data for RAG Systems: A Deep Dive

May 23, 202616 min readBy Mohammed Vasim
RAGLLMdata-synthesisevaluationlangchainnvidia-api

The Problem We're Solving

Evaluating RAG (Retrieval-Augmented Generation) systems is challenging. You need:

  • Ground truth Q&A pairs that accurately reflect your documents
  • Realistic questions users would actually ask
  • Context to verify answers were derived from documents, not hallucinations
  • Scale — manually creating test cases is impractical

This notebook automates all of this by leveraging LLMs to generate contextually relevant questions and answers from PDF documents.

The Pipeline

PDFs → Markdown → Chunks → LLM (Agent) → Q&A Pairs → DataFrame → CSV

Each step serves a specific purpose in creating high-quality evaluation data automatically.

python
!pip install chromadb@>=1.4.1 deepeval@>=3.8.3 ipykernel@>=7.2.0 ipywidgets@>=8.1.8 langchain@>=1.2.7 langchain-community@>=0.4.1 langchain-core@>=1.2.7 langchain-openai@>=1.1.7 langgraph@>=1.0.7 pandas@>=3.0.3 pymupdf4llm@>=1.27.2.3 pypdf@>=6.6.2 tiktoken@>=0.12.0

Step 1: Dependencies & Setup

We need specific tools for our pipeline. Let's install them with pinned versions for reproducibility.

What We're Installing

Think of this as gathering all the tools you need before starting a big project. We're installing:

  • LangChain ecosystem — Tools to build LLM workflows (LangChain, LangChain-OpenAI, LangChain-Community)
  • PDF Processing — PyMuPDF4LLM (converts PDFs to clean markdown) and PyPDF (for PDF manipulation)
  • Data Handling — Pandas for organizing results into tables
  • Vector Databases — ChromaDB (though we won't use it here, good to have)
  • Evaluation — DeepEval for assessing LLM output quality

We pin versions (like @>=1.4.1) to ensure reproducibility. Run the same code a year from now and you'll get the same versions, avoiding mysterious bugs from library updates.

python
"""Implementing data synthizing for RAG evaluation"""

import os

Initializing the Environment

Here we're just importing the essentials and setting up our workspace. We start with:

  • os — To read environment variables (like our API key)

Think of this as opening your toolkit and checking that everything is at hand before we start working.

Step 2: API Configuration

We use NVIDIA's API instead of OpenAI directly for several reasons:

  • Cost — Cheaper endpoints for development
  • Local-like performance — Free tier available
  • Model access — Open-source models like gpt-oss-120b
  • Embeddings — Optimized embedding models like nv-embed-v1

The OpenAI client library is compatible with NVIDIA's API, so we just change the base URL.

Setting Up Your API Keys & Model Names

Think of this as configuring your delivery address. We're telling the system:

  • Where to send requests — NVIDIA's API endpoint (not OpenAI's default)
  • How to authenticate — Your API key from the environment
  • Which models to use — Embeddings model for creating vectors, chat model for generating Q&A

Why NVIDIA instead of OpenAI? They offer:

  • Lower costs — Especially good for development and batch processing
  • Open-source models — Like GPT-OSS-120B (faster, cheaper, fully open)
  • Optimized embeddings — NVIDIA's nv-embed-v1 is specifically tuned for semantic search

Your API key is stored as an environment variable (NVIDIA_API_KEY) for security — never hardcode secrets in notebooks!

python
BASE_URL = "https://integrate.api.nvidia.com/v1"
API_KEY = os.getenv("NVIDIA_API_KEY")
EMBED_MODEL_NAME = "nvidia/nv-embed-v1"
CHAT_MODEL_NAME = "openai/gpt-oss-120b"
python
from langchain_openai import ChatOpenAI

# Replace these with real values
chat_model = ChatOpenAI(
    model="openai/gpt-oss-120b",
    api_key=API_KEY,
    base_url=BASE_URL
)

Creating Your Chat Model Object

Here's where we actually connect to the API. ChatOpenAI from LangChain is a wrapper that:

  • Abstracts away the complexity of making API calls
  • Handles message formatting automatically
  • Provides a clean Python interface

We're pointing it to NVIDIA's API by overriding the base_url. This gives us the familiar OpenAI interface but with NVIDIA's cheaper, faster models in the background.

Think of it like this: It's like having a translator who speaks the same language as OpenAI's API, but redirects your requests to NVIDIA's servers instead.

python
chat_model.invoke("hello")

Testing the Connection

Before we process hundreds of documents, let's make sure everything works. This is a smoke test — a quick sanity check that:

  • The API credentials are valid
  • The network connection works
  • The model endpoint is reachable

If this fails, we catch the error early before wasting time. It's like testing your car starts before driving cross-country.

python
files = [
    "/Users/vasim/Downloads/Data Engineering for Foundation Models: The Alchemist’s Cookbook | Mohammed Vasim.pdf",
    "/Users/vasim/Downloads/The Tale of Meaningful Vectors: Contrastive Learning for Text Embeddings, Told with Pen and Paper | Mohammed Vasim.pdf",
]

Specifying Your Input Documents

Here we define which PDFs to mine for Q&A pairs. In this example, we're using:

  • Two technical papers about machine learning and data engineering

Key insight: The quality and relevance of your evaluation data depends on the documents you choose. If your PDFs are about data engineering, you'll generate Q&A pairs that are hyper-relevant for evaluating a data engineering RAG system.

Change this list to point to your own domain documents for domain-specific evaluation data.

python
import random
import string
import pymupdf
pymupdf.layout

import pymupdf4llm

Importing PDF & Utility Libraries

We're gathering the tools we need for PDF processing:

  • pymupdf — Lower-level PDF manipulation (gives us granular control)
  • pymupdf4llm — Higher-level API that converts PDFs directly to markdown (perfect for LLMs)
  • random & string — For generating random IDs (useful for tracking generated pairs)

The beauty of pymupdf4llm is that it understands PDFs are meant for humans to read, so it preserves structure, headings, and layout in the markdown output. No garbled text!

Step 3: PDF to Markdown Conversion

Why convert PDFs to markdown instead of raw text?

  • LLM-friendly — Markdown is semantic and preserves structure
  • Preserves hierarchy — Headings and sections remain meaningful
  • Handles images — Can optionally extract and catalog images
  • Better extraction — Preserves relationships between content

This transformation is crucial because LLMs work better with structured, semantic content.

Utility: Generating Random IDs

This is a simple helper function that creates unique random strings. Why?

When you generate hundreds of Q&A pairs, having unique IDs helps with:

  • Tracking — Know which pair came from which run
  • Deduplication — If you re-run the pipeline, you can filter out pairs you've already generated
  • Audit trails — Keep a record of what was generated when

The function creates strings like "A7K9P2" — useful for human-readable identifiers.

python
def generate_rand_string(length=6):
    """Generate random string"""
    return "".join(random.choices(
        string.ascii_letters + string.digits, k=length
        )).upper()

print(generate_rand_string())
python
def convert_pdf_to_md(files: list[str], write_images: bool = False):
    """Convert pdf to markdown"""
    content = {}

    for file in files:
        filename = os.path.basename(file)
        img_path = f"../data/images/{filename}"
        content[filename] = pymupdf4llm.to_markdown(
            file, image_path=img_path, write_images=write_images
        )

        if write_images:
            content[img_path] = [
                os.path.join(img_path, image_path)
                for image_path in os.listdir(img_path)
            ]

    return content

The Core Converter: PDF → Markdown

This function does the heavy lifting of transforming PDFs into a format that LLMs love. Here's what's happening:

What it does:

  1. Loops through each PDF file
  2. Extracts the filename
  3. Converts the PDF to markdown using pymupdf4llm
  4. Optionally extracts and catalogs images
  5. Returns everything as a dictionary

Why markdown?

  • Semantic — Preserves headings, bold text, lists, code blocks
  • LLM-friendly — Models understand markdown structure better than raw text
  • Lossless — No information is lost in the conversion
  • Searchable — Easy to find specific sections later

Think of it like this: Converting a PDF to markdown is like transcribing a handwritten document into clean typed text — the content is the same, but now it's in a format computers and LLMs can work with easily.

python
content = convert_pdf_to_md(files)

Running the Conversion

Here's where the PDFs become markdown. The output is a dictionary where:

  • Keys = filename (like "The Tale of Meaningful Vectors.pdf")
  • Values = the entire PDF converted to markdown text

This typically takes a few seconds to a minute depending on PDF size. Once complete, you have clean, structured text ready for processing.

python
content

Inspecting the Converted Content

This cell displays what we got back from the PDF conversion. You should see:

  • Well-formed markdown with headings, sections, and structure intact
  • No gibberish or OCR errors
  • Readable, semantic content

Run this to verify the conversion worked well before moving to the next step. If you see formatting issues here, the later Q&A generation will suffer.

python
import sys
sys.path.insert(0, os.path.abspath(".."))

Setting Up Python Path

This allows us to import modules from the parent directory. We're assuming there's a utils.py file one level up that contains helper functions like chunk_text().

Why? It keeps code organized:

  • Reusable utilities go in utils.py
  • Specific notebook logic stays in the notebook
  • Easy to share utilities across multiple notebooks
python
from utils import chunk_text

Importing the Chunking Function

chunk_text() is a custom utility that intelligently breaks documents into smaller pieces. It's typically built to:

  • Respect semantic boundaries — Splits at section breaks, not in the middle of sentences
  • Keep context together — Related information stays in the same chunk
  • Handle variable sizes — Chunks can be different sizes to preserve meaning

This is smarter than naive approaches like "split every 500 tokens" — it understands that a paragraph about one concept should stay together.

python
chunks = []

for value in content.values():
    chunks.extend(chunk_text(value))

Breaking Documents Into Chunks

This is where we take the full markdown content and split it into bite-sized pieces. Each chunk is:

  • Small enough to fit comfortably in an LLM's context window
  • Large enough to contain complete thoughts and ideas
  • Semantically coherent — related information stays together

The result is a list of text chunks, something like:

[ "## Section 1\n\nContent here...", "## Section 2\n\nMore content...", ... ]

Each chunk will later become one Q&A pair, so if you have 50 chunks, you'll get ~50 Q&A pairs. More chunks = more comprehensive evaluation data.

Step 4: Chunking Content

Why do we need to chunk documents?

  • LLM context limits — Can't feed entire documents at once (context window is limited)
  • Focused Q&A — Smaller chunks generate more precise, grounded questions
  • Parallelizable — Each chunk can be processed independently
  • Higher quality — Focused context produces better evaluation data

Each chunk is small enough to fit in a prompt while large enough to contain complete ideas.

python
chunks

Inspecting Your Chunks

Run this to see what you're working with:

  • How many chunks were created?
  • Are they reasonable sizes?
  • Any empty chunks?
  • Is the content making sense?

This is a quality check. If you see fragmented or weird chunks, the chunk_text() function might need tuning. Better to catch issues now than after processing thousands of pairs!

python
from uuid import uuid4, UUID
from pydantic import BaseModel, Field

Setting Up Data Structures

We're preparing to work with structured data. Here's what we import:

  • UUID — Universally Unique Identifiers (guarantees each Q&A pair has a unique ID)
  • Pydantic — A library for defining data schemas with validation

Think of Pydantic as creating a blueprint: "Every Q&A pair must have exactly these fields, with these types, and they must be valid."

python
class QAPair(BaseModel):
    """QAPair"""

    id: UUID = Field(description="UUID")
    query: str = Field(description="User query")
    ai_response: str = Field(description="AI response")
    context: str = Field(description="Context used by AI for response")

Defining the Q&A Pair Schema

This is our blueprint for generated Q&A pairs. Each pair must have exactly these four things:

What each field means:

  • id — Unique identifier (UUID) for this pair
  • query — The question a user would ask (e.g., "What is embeddings?")
  • ai_response — The correct answer (grounded in the source text)
  • context — The exact chunk from the document this pair came from

Why define this?

  1. Type safety — Forces the LLM to return well-formed data
  2. Validation — Pydantic checks each field is the right type
  3. Documentation — Makes it clear what data we're working with
  4. Serialization — Easy to convert to JSON/CSV later

Think of it as a contract: "I'm expecting Q&A pairs that look exactly like this, no substitutions."

python
SYSTEM_PROMPT = """
You are an AI assistant specialized in creating evaluation data for retrieval‑augmented generation (RAG) systems.
You will receive a TEXT chunk.
Your task is to create a realistic, standalone question that can be answered **only** using the information in the TEXT.
Then, provide the correct answer to that question.
The question must sound like something a real user would ask (vary the phrasing: factual, comparative, list, definition, yes/no, etc.).
The answer must be entirely grounded in the TEXT – do not add external knowledge.

Output a single Pydantic object with exactly these four keys:
- "id": a short unique identifier (e.g., "1", "2", ...). Use a simple incrementing number starting from 1 for each pair you generate.
- "question": the generated question.
- "answer": the ground‑truth answer.
- "context": the exact TEXT you were given (copy it verbatim).

Now generate the Pydantic object.
"""

The Magic: The System Prompt

This is prompt engineering in action. We're giving the LLM precise instructions on what to do:

The Recipe:

  1. Context — "You're creating RAG evaluation data"
  2. Task — "Generate questions that can ONLY be answered using the given text"
  3. Quality — "Make them sound like real user questions; vary the types"
  4. Grounding — "No external knowledge; stick to the text"
  5. Output — "Return structured data matching the QAPair schema"

Why each instruction matters:

  • "Only using the information in the TEXT" — Prevents hallucinations. We want answers grounded in your documents, not LLM training data.
  • "Vary the phrasing" — Factual, comparative, list, definition, yes/no questions make better evaluation data
  • "Exactly these four keys" — Forces structured output we can parse and validate

This is the most important part of the whole pipeline. A good prompt → good data. A vague prompt → garbage data. Spending time here pays dividends.

Step 5: The Core Logic — Q&A Generation

This is where the magic happens. We use prompt engineering to instruct the LLM to:

  • Generate realistic questions — "sound like something a real user would ask"
  • Ensure groundedness — Answer must come from the provided text only
  • Vary question types — Factual, comparative, list-based, definitions, yes/no
  • Return structured data — Exactly matching our Pydantic schema

Why Structured Output?

Using Pydantic ensures:

  • Type safety — Forces consistent data structure
  • Validation — All fields present and correct types
  • Serialization — Easy conversion to JSON/CSV
  • Error catching — Schema mismatches fail early
python
from langchain.agents import create_agent

agent = create_agent(
    model=chat_model,
    system_prompt=SYSTEM_PROMPT,
    response_format=QAPair
)

Creating the Q&A Generation Agent

An agent is like a smart assistant that:

  • Follows your system prompt faithfully
  • Formats messages correctly for the LLM
  • Validates output against the QAPair schema
  • Handles errors gracefully

By wrapping everything in an agent, we get:

  • Consistency — Same behavior every time
  • Error handling — Knows what to do if something goes wrong
  • Output validation — Ensures we get valid QAPair objects back

Think of it as creating a specialized assistant whose only job is: "Take text, make Q&A pairs that follow these exact rules."

python
agent_output = []

Preparing Output Collection

We create an empty list to collect all the Q&A pairs as we generate them. After processing all chunks, this list will contain something like:

[ {"structured_response": QAPair(...), ...}, {"structured_response": QAPair(...), ...}, ... ]

We'll later convert this into a DataFrame and export to CSV.

python
for chunk in chunks:
    response = agent.invoke({"messages": [{"role": "user", "content": chunk}]})
    agent_output.append(response)

The Main Loop: Generating Q&A Pairs

This is where the magic happens. For each chunk of text:

  1. Send it to the agent with our system prompt
  2. The agent generates a Q&A pair using the LLM
  3. Store the result in our output list

What's happening under the hood:

  • The agent takes the chunk and wraps it in a message
  • Sends it to the LLM (gpt-oss-120b via NVIDIA API)
  • Gets back structured data (a QAPair object)
  • Validates it against our schema
  • Stores it

Timeline:

  • If you have 50 chunks, this runs 50 times
  • Each call takes ~5-10 seconds (depends on LLM latency)
  • Total runtime: ~5-10 minutes for 50 chunks

You'll see progress as it runs. Patient processing, but the results are worth it!

Processing Each Chunk

The agent loop generates one Q&A pair per chunk. This approach is:

  • Fully automated — No manual annotation required
  • Scalable — Can process thousands of chunks
  • Reproducible — Same documents → same output (with deterministic settings)
  • Grounded — Each pair is tied to specific source text

As we process, we collect all responses for later analysis and export.

python
agent_output

Inspecting All Outputs

This shows what we got back from all the agent calls. You'll see a list of responses, each containing a structured_response field with a QAPair object.

At this point, all generation is done. The rest is just formatting and exporting.

python
sample_output = agent_output[0]["structured_response"]
sample_output

Extracting a Sample Q&A Pair

Let's look at the first generated pair to see what we're working with. This is a quality check:

  • Is the question realistic?
  • Is the answer grounded in the context?
  • Does it look useful for evaluation?

Running this extracts the QAPair object from the first agent response.

python
print(f"ID: {sample_output.id}")
print(f"Query: {sample_output.query}")
print(f"Answer: {sample_output.ai_response}")
print(f"Context: {sample_output.context}")

Pretty-Printing the Sample

This makes the sample pair human-readable. You get:

  • ID — Unique identifier
  • Query — The generated question
  • Answer — The grounded answer
  • Context — The source text it came from

This is what one evaluation pair looks like. Inspect it carefully to ensure quality before processing your full dataset.

python
sample_output.model_dump()

Converting to Dictionary

model_dump() converts the Pydantic object into a plain Python dictionary. This is necessary for:

  • Converting to JSON
  • Putting into a DataFrame
  • Exporting to CSV

It transforms:

QAPair(id=UUID(...), query="...", ai_response="...", context="...")

Into:

{"id": "...", "query": "...", "ai_response": "...", "context": "..."}

Much more portable and serializable!

python
import pandas as pd

Importing Pandas

Pandas is the Python data analysis library. We're about to organize all our Q&A pairs into a structured table (DataFrame), which makes it easy to:

  • Inspect and filter data
  • Export to CSV
  • Analyze statistics
  • Share with evaluation frameworks
python
data = [item["structured_response"].model_dump() for item in agent_output]
data

Converting All Q&A Pairs to Dictionaries

This is a list comprehension that extracts every structured_response from all agent outputs and converts each to a dictionary.

Result: A list of plain dictionaries, ready to become rows in a DataFrame:

[ {"id": "...", "query": "...", "ai_response": "...", "context": "..."}, {"id": "...", "query": "...", "ai_response": "...", "context": "..."}, ... ]
python
df = pd.DataFrame(data)
df.head()

Creating a DataFrame

This transforms our list of dictionaries into a structured table. Running df.head() shows you the first 5 rows, which looks something like:

idqueryai_responsecontext
UUID-1What is...?The answer is...Full chunk text...
UUID-2How does...?It works by...Full chunk text...
UUID-3Why do...?Because...Full chunk text...

This is your evaluation dataset in tabular form. Beautiful, structured, and ready for analysis or export.

python
from datetime import datetime

def get_datetime():
    """Get datetime"""
    return datetime.now().strftime("%Y%m%d_%H%M%S")

Creating Timestamped Filenames

This helper generates a timestamp like 20260523_143022 (May 23, 2026 at 14:30:22).

Why?

  • Unique filenames — Never accidentally overwrite previous results
  • Audit trail — See exactly when each dataset was generated
  • Versioning — Easy to compare different runs side-by-side

Each run produces a new file, so you can:

  • Iterate on prompts and compare results
  • Keep a history of datasets
  • Track improvements over time

Step 6: Export & Versioning

We save the evaluation dataset to CSV with timestamped filenames. Benefits:

  • Portable — Works with any tool (Excel, Python, evaluation frameworks)
  • Version controlled — Timestamped files prevent accidental overwrites
  • Human readable — Easy to inspect, filter, and validate
  • Audit trail — Can track what data was generated when

Each run produces a unique file like 20260523_143022.csv, making it easy to iterate and compare results.

python
df.to_csv(f"../data/eval-sets/{get_datetime()}.csv")

Exporting Your Dataset

This saves your entire DataFrame to a CSV file with a timestamped name like:

../data/eval-sets/20260523_143022.csv

What you get:

  • A standard CSV file (opens in Excel, Google Sheets, Python, etc.)
  • One row per Q&A pair
  • Columns: id, query, ai_response, context
  • Ready for import into evaluation frameworks (DeepEval, Ragas, etc.)

Next steps with this data:

  1. Inspect it — Open in Excel, verify quality
  2. Evaluate — Use with RAG evaluation frameworks
  3. Iterate — Adjust prompts, regenerate, compare results
  4. Version control — Store alongside your model versions
  5. Improve — Use results to refine your RAG pipeline

Congratulations! You've automatically generated a dataset that would take hours to create manually. 🎉

Summary: Why This Approach Matters

The Complete Pipeline

PDFs → Markdown → Chunks → LLM (Agent) → Q&A Pairs → DataFrame → CSV

Key Advantages

  1. Fully automated — Minimal manual effort required
  2. Grounded — Every answer is tied to source text (prevents hallucinations)
  3. Scalable — Process hundreds or thousands of documents
  4. Reproducible — Same inputs produce consistent outputs
  5. Structured — Pydantic ensures data quality and consistency
  6. Exportable — CSV works with any evaluation framework

Why This Matters for RAG

  • Evaluation is critical — How do you know if your RAG system works?
  • Ground truth is expensive — This automates the creation process
  • Scale matters — Hundreds of test cases catch edge cases and failure modes
  • Grounding prevents hallucinations — We verify answers against source documents
  • Reproducibility — You can re-generate identical datasets for regression testing

Next Steps

  1. Integrate with evaluation frameworks — Use DeepEval or Ragas to assess RAG quality
  2. Add question variety filters — Don't just generate factual questions; include comparative, analytical, etc.
  3. Implement quality gates — Filter out low-quality pairs programmatically
  4. Version control datasets — Store evaluation CSVs alongside model versions
  5. Feedback loops — Use evaluation results to improve your RAG retrieval/generation

This approach transforms a tedious manual task into a reliable, scalable process for continuous RAG evaluation. 🚀

Comments (0)

No comments yet. Be the first to comment!

Leave a comment