~/projects

LLM Fine-Tuning Dataset Generator

agentic-data-preparationdataset-preparationstreamlitfine-tuning

Manually writing instruction-response pairs for fine-tuning a language model is expensive and doesn't scale. A meaningful fine-tune typically requires thousands of examples — at 10–15 minutes per pair, that's weeks of labor before training even starts.

This app automates the process using a three-stage synthesis pipeline that turns your own documents into a clean, Alpaca-format training dataset.

Video Demo

Streamlit Demo

Try here: synthesize-finetuning-dataset.streamlit.app

How It Works

PDFs → Markdown → Topic Chunks → Atomic Facts → QA Pairs → CSV / JSON

Stage 1: Parse & Chunk

PDFs are converted to structured markdown using PyMuPDF4LLM, preserving heading hierarchy and semantic structure. The markdown is then split at heading boundaries into topic chunks — one dict per section, with a title and content field.

Splitting at headings rather than at fixed token counts keeps each chunk semantically coherent. A fixed-token splitter cuts mid-paragraph; heading-based splitting respects conceptual boundaries.

Stage 2: Extract Atomic Facts

Each topic chunk is passed to an LLM with a structured extraction prompt. The output is a list of atomic facts: single, self-contained declarative sentences that represent the core domain knowledge in that section.

This intermediate step is what separates this pipeline from naive QA generation. By reducing each chunk to verified facts first, the question generator in Stage 3 has a constrained, grounded source of truth — hallucinated answers can't slip in because every answer must trace back to an extracted fact.

Stage 3: Generate QA Pairs

The facts list for each chunk is passed to the LLM with a generation prompt that enforces:

Question variety — what, how, why, compare/contrast, enumerate, true/false
Difficulty spread — simple fact retrieval alongside multi-fact synthesis questions
Answer depth — 200–300 word answers with context and supporting detail

The output is a list of Alpaca-format {"instruction", "input", "output"} dicts, ready for any fine-tuning framework.

Features

No hand-labeling — the entire dataset is synthesized from documents you already own
Grounded output — facts are extracted before questions are generated, preventing hallucination
Alpaca-format export — compatible with Axolotl, LLaMA Factory, and Hugging Face training scripts
OpenAI-compatible backend — works with NVIDIA API, Ollama, vLLM, or OpenAI directly
CSV + JSON export — timestamped files for reproducible runs
Streamlit UI — upload PDFs, configure the model, run the pipeline, download the dataset

Tech Stack

PyMuPDF4LLM — PDF to structured markdown
LangChain + LangChain-OpenAI — LLM orchestration and structured output via Pydantic
Pydantic — schema enforcement at each stage (facts and QA pairs)
NVIDIA API — access to open-source models (GPT-OSS-120B)
Pandas — dataset assembly and export
Streamlit — UI and file handling

The full implementation walkthrough — including all prompts, Pydantic schemas, and design decisions — is in the companion blog post: Building LLM Fine-Tuning Data Without Hand-Labeling a Single Example.