A powerful tool for automatically generating evaluation datasets for Retrieval-Augmented Generation (RAG) systems. Transform your PDF documents into structured Q&A pairs that can be used to test, benchmark, and improve your RAG applications.
The Problem
Evaluating RAG systems is hard. You need:
- Ground truth Q&A pairs that accurately reflect your documents
- Realistic questions users would actually ask
- Context to verify answers were derived from documents, not hallucinations
- Scale — manually creating test cases is impractical
Creating evaluation data manually is time-consuming, error-prone, and doesn't scale. This tool automates the entire process.
How It Works
The pipeline transforms documents into evaluation-ready datasets:
PDFs → Markdown → Chunks → LLM (Agent) → Q&A Pairs → DataFrame → CSV
Step 1: PDF to Markdown Conversion
Converts PDF documents into clean, structured markdown using PyMuPDF4LLM. This preserves:
- Document hierarchy and headings
- Semantic structure
- Code blocks and formatting
Step 2: Intelligent Chunking
Breaks documents into semantically coherent chunks that:
- Fit comfortably in LLM context windows
- Contain complete thoughts and ideas
- Respect semantic boundaries
Step 3: LLM-Powered Q&A Generation
Uses an AI agent with carefully engineered prompts to generate:
- Realistic, user-like questions
- Grounded answers (derived only from source text)
- Varied question types (factual, comparative, definitions, yes/no)
- Exact context for verification
Step 4: Structured Export
Outputs to CSV format with:
- Unique identifiers for each pair
- Query (user question)
- AI response (ground-truth answer)
- Context (source text)
Features
- Fully Automated — Minimal manual effort required
- Grounded Output — Every answer is tied to source text, preventing hallucinations
- Scalable — Process hundreds or thousands of documents
- Reproducible — Same inputs produce consistent outputs
- Structured Data — Pydantic validation ensures quality and consistency
- Framework Ready — Export to CSV works with DeepEval, Ragas, and other evaluation tools
Tech Stack
- LangChain — LLM workflow orchestration
- LangGraph — Agent implementation
- PyMuPDF4LLM — PDF to Markdown conversion
- NVIDIA API — Access to open-source models (GPT-OSS-120B, NV-Embed-V1)
- Pandas — Data organization and export
Use Cases
- RAG Evaluation — Generate test cases to measure retrieval and generation quality
- Benchmarking — Create standardized datasets for comparing RAG implementations
- Regression Testing — Ensure changes don't break existing functionality
- Domain-Specific Testing — Generate evaluation data from your own documents
Live Demo
Try the live application: RAG Eval Dataset Generator
Video Demo
Watch the full demonstration:
Get Started
- Prepare your PDF documents
- Configure your NVIDIA API key
- Run the pipeline
- Export and use your evaluation dataset
For detailed implementation, check out the full notebook.
Next Steps
- Integrate with evaluation frameworks — Use DeepEval or Ragas to assess RAG quality
- Add question variety filters — Include comparative, analytical, and other question types
- Implement quality gates — Filter out low-quality pairs programmatically
- Version control datasets — Store evaluation CSVs alongside model versions
- Feedback loops — Use evaluation results to improve your RAG pipeline
Transform tedious manual annotation into a reliable, scalable process for continuous RAG evaluation.