Back to projects

~/projects

RAG Eval Dataset Generator

A powerful tool for automatically generating evaluation datasets for Retrieval-Augmented Generation (RAG) systems. Transform your PDF documents into structured Q&A pairs that can be used to test, benchmark, and improve your RAG applications.

The Problem

Evaluating RAG systems is hard. You need:

  • Ground truth Q&A pairs that accurately reflect your documents
  • Realistic questions users would actually ask
  • Context to verify answers were derived from documents, not hallucinations
  • Scale — manually creating test cases is impractical

Creating evaluation data manually is time-consuming, error-prone, and doesn't scale. This tool automates the entire process.

How It Works

The pipeline transforms documents into evaluation-ready datasets:

PDFs → Markdown → Chunks → LLM (Agent) → Q&A Pairs → DataFrame → CSV

Step 1: PDF to Markdown Conversion

Converts PDF documents into clean, structured markdown using PyMuPDF4LLM. This preserves:

  • Document hierarchy and headings
  • Semantic structure
  • Code blocks and formatting

Step 2: Intelligent Chunking

Breaks documents into semantically coherent chunks that:

  • Fit comfortably in LLM context windows
  • Contain complete thoughts and ideas
  • Respect semantic boundaries

Step 3: LLM-Powered Q&A Generation

Uses an AI agent with carefully engineered prompts to generate:

  • Realistic, user-like questions
  • Grounded answers (derived only from source text)
  • Varied question types (factual, comparative, definitions, yes/no)
  • Exact context for verification

Step 4: Structured Export

Outputs to CSV format with:

  • Unique identifiers for each pair
  • Query (user question)
  • AI response (ground-truth answer)
  • Context (source text)

Features

  • Fully Automated — Minimal manual effort required
  • Grounded Output — Every answer is tied to source text, preventing hallucinations
  • Scalable — Process hundreds or thousands of documents
  • Reproducible — Same inputs produce consistent outputs
  • Structured Data — Pydantic validation ensures quality and consistency
  • Framework Ready — Export to CSV works with DeepEval, Ragas, and other evaluation tools

Tech Stack

  • LangChain — LLM workflow orchestration
  • LangGraph — Agent implementation
  • PyMuPDF4LLM — PDF to Markdown conversion
  • NVIDIA API — Access to open-source models (GPT-OSS-120B, NV-Embed-V1)
  • Pandas — Data organization and export

Use Cases

  • RAG Evaluation — Generate test cases to measure retrieval and generation quality
  • Benchmarking — Create standardized datasets for comparing RAG implementations
  • Regression Testing — Ensure changes don't break existing functionality
  • Domain-Specific Testing — Generate evaluation data from your own documents

Live Demo

Try the live application: RAG Eval Dataset Generator

Video Demo

Watch the full demonstration:

Get Started

  1. Prepare your PDF documents
  2. Configure your NVIDIA API key
  3. Run the pipeline
  4. Export and use your evaluation dataset

For detailed implementation, check out the full notebook.

Next Steps

  1. Integrate with evaluation frameworks — Use DeepEval or Ragas to assess RAG quality
  2. Add question variety filters — Include comparative, analytical, and other question types
  3. Implement quality gates — Filter out low-quality pairs programmatically
  4. Version control datasets — Store evaluation CSVs alongside model versions
  5. Feedback loops — Use evaluation results to improve your RAG pipeline

Transform tedious manual annotation into a reliable, scalable process for continuous RAG evaluation.