~/projects

RAG Eval Dataset Generator

agentic-data-preparationdataset-preparationstreamlitragdata-synthesis

A powerful tool for automatically generating evaluation datasets for Retrieval-Augmented Generation (RAG) systems. Transform your PDF documents into structured Q&A pairs that can be used to test, benchmark, and improve your RAG applications.

The Problem

Evaluating RAG systems is hard. You need:

Ground truth Q&A pairs that accurately reflect your documents
Realistic questions users would actually ask
Context to verify answers were derived from documents, not hallucinations
Scale — manually creating test cases is impractical

Creating evaluation data manually is time-consuming, error-prone, and doesn't scale. This tool automates the entire process.

How It Works

The pipeline transforms documents into evaluation-ready datasets:

PDFs → Markdown → Chunks → LLM (Agent) → Q&A Pairs → DataFrame → CSV

Step 1: PDF to Markdown Conversion

Converts PDF documents into clean, structured markdown using PyMuPDF4LLM. This preserves:

Document hierarchy and headings
Semantic structure
Code blocks and formatting

Step 2: Intelligent Chunking

Breaks documents into semantically coherent chunks that:

Fit comfortably in LLM context windows
Contain complete thoughts and ideas
Respect semantic boundaries

Step 3: LLM-Powered Q&A Generation

Uses an AI agent with carefully engineered prompts to generate:

Realistic, user-like questions
Grounded answers (derived only from source text)
Varied question types (factual, comparative, definitions, yes/no)
Exact context for verification

Step 4: Structured Export

Outputs to CSV format with:

Unique identifiers for each pair
Query (user question)
AI response (ground-truth answer)
Context (source text)

Features

Fully Automated — Minimal manual effort required
Grounded Output — Every answer is tied to source text, preventing hallucinations
Scalable — Process hundreds or thousands of documents
Reproducible — Same inputs produce consistent outputs
Structured Data — Pydantic validation ensures quality and consistency
Framework Ready — Export to CSV works with DeepEval, Ragas, and other evaluation tools

Tech Stack

LangChain — LLM workflow orchestration
LangGraph — Agent implementation
PyMuPDF4LLM — PDF to Markdown conversion
NVIDIA API — Access to open-source models (GPT-OSS-120B, NV-Embed-V1)
Pandas — Data organization and export

Use Cases

RAG Evaluation — Generate test cases to measure retrieval and generation quality
Benchmarking — Create standardized datasets for comparing RAG implementations
Regression Testing — Ensure changes don't break existing functionality
Domain-Specific Testing — Generate evaluation data from your own documents

Live Demo

Try the live application: RAG Eval Dataset Generator

Video Demo

Watch the full demonstration:

Get Started

Prepare your PDF documents
Configure your NVIDIA API key
Run the pipeline
Export and use your evaluation dataset

For detailed implementation, check out the full notebook.

Next Steps

Integrate with evaluation frameworks — Use DeepEval or Ragas to assess RAG quality
Add question variety filters — Include comparative, analytical, and other question types
Implement quality gates — Filter out low-quality pairs programmatically
Version control datasets — Store evaluation CSVs alongside model versions
Feedback loops — Use evaluation results to improve your RAG pipeline

Transform tedious manual annotation into a reliable, scalable process for continuous RAG evaluation.