Manually writing instruction-response pairs for fine-tuning a language model is expensive and doesn't scale. A meaningful fine-tune typically requires thousands of examples — at 10–15 minutes per pair, that's weeks of labor before training even starts.
This app automates the process using a three-stage synthesis pipeline that turns your own documents into a clean, Alpaca-format training dataset.
Video Demo
Streamlit Demo
Try here: synthesize-finetuning-dataset.streamlit.app
How It Works
PDFs → Markdown → Topic Chunks → Atomic Facts → QA Pairs → CSV / JSON
Stage 1: Parse & Chunk
PDFs are converted to structured markdown using PyMuPDF4LLM, preserving heading hierarchy and semantic structure. The markdown is then split at heading boundaries into topic chunks — one dict per section, with a title and content field.
Splitting at headings rather than at fixed token counts keeps each chunk semantically coherent. A fixed-token splitter cuts mid-paragraph; heading-based splitting respects conceptual boundaries.
Stage 2: Extract Atomic Facts
Each topic chunk is passed to an LLM with a structured extraction prompt. The output is a list of atomic facts: single, self-contained declarative sentences that represent the core domain knowledge in that section.
This intermediate step is what separates this pipeline from naive QA generation. By reducing each chunk to verified facts first, the question generator in Stage 3 has a constrained, grounded source of truth — hallucinated answers can't slip in because every answer must trace back to an extracted fact.
Stage 3: Generate QA Pairs
The facts list for each chunk is passed to the LLM with a generation prompt that enforces:
- Question variety — what, how, why, compare/contrast, enumerate, true/false
- Difficulty spread — simple fact retrieval alongside multi-fact synthesis questions
- Answer depth — 200–300 word answers with context and supporting detail
The output is a list of Alpaca-format {"instruction", "input", "output"} dicts, ready for any fine-tuning framework.
Features
- No hand-labeling — the entire dataset is synthesized from documents you already own
- Grounded output — facts are extracted before questions are generated, preventing hallucination
- Alpaca-format export — compatible with Axolotl, LLaMA Factory, and Hugging Face training scripts
- OpenAI-compatible backend — works with NVIDIA API, Ollama, vLLM, or OpenAI directly
- CSV + JSON export — timestamped files for reproducible runs
- Streamlit UI — upload PDFs, configure the model, run the pipeline, download the dataset
Tech Stack
- PyMuPDF4LLM — PDF to structured markdown
- LangChain + LangChain-OpenAI — LLM orchestration and structured output via Pydantic
- Pydantic — schema enforcement at each stage (facts and QA pairs)
- NVIDIA API — access to open-source models (GPT-OSS-120B)
- Pandas — dataset assembly and export
- Streamlit — UI and file handling
Related
The full implementation walkthrough — including all prompts, Pydantic schemas, and design decisions — is in the companion blog post: Building LLM Fine-Tuning Data Without Hand-Labeling a Single Example.