~/blog

Building LLM Fine-Tuning Data Without Hand-Labeling a Single Example

Jun 6, 2026•54 min read•By Mohammed Vasim

llmfine-tuningdata-synthesislangchainpython

Fine-tuning a language model on domain knowledge sounds manageable until you estimate the labor involved. A single high-quality instruction-response pair might take 10–15 minutes to write carefully — multiply that by the thousands of examples a meaningful fine-tune requires, and the math breaks quickly.

A more practical path: synthesize the training data from documents you already own. Three stages connect raw PDFs to a clean Alpaca-format dataset ready for fine-tuning:

Parse & Chunk — extract markdown from PDFs and split by topic heading
Extract Facts — compress each chunk into atomic, self-contained factual statements
Generate QA Pairs — use those facts as ground truth to produce diverse instruction–response pairs

The key is that splitting fact extraction from question generation gives the LLM a constrained, verifiable source of truth at each stage. A question cannot be hallucinated from thin air when the answer has to come from a pre-extracted fact.

pymupdf4llm converts PDFs to markdown while preserving heading hierarchy, tables, and code blocks. langchain-openai provides structured output support over any OpenAI-compatible endpoint. tqdm and pandas handle progress display and final data manipulation.

python

%pip install "data-designer>=0.6.0" "deepeval>=3.8.3" "ipykernel>=7.2.0" "langchain>=1.2.7" "langchain-community>=0.4.1" "langchain-core>=1.2.7" "langchain-openai>=1.1.7" "langgraph>=1.0.7" "mistune>=3.2.1" "pymupdf4llm>=1.27.2.3" "streamlit>=1.57.0"

Setting the parent directory on sys.path makes utils.py — which contains the chunk_markdown_by_topic helper — importable from the notebook.

python

import os
import sys
import pymupdf4llm

sys.path.insert(0, os.path.abspath(".."))

Stage 1: Parsing Source Documents

The source documents are two blog posts saved as PDFs — but any domain-specific PDFs work: internal wikis, research papers, technical documentation. The content structure influences downstream quality; documents with clear heading hierarchy produce cleaner, more coherent chunks.

Converting PDFs to Markdown

pymupdf4llm.to_markdown converts each PDF page by page, preserving heading structure. The result is a markdown string where each section heading maps to a ## or ### block — exactly what chunk_markdown_by_topic splits on.

Setting write_images=False skips image extraction, which would add file I/O overhead without benefiting text-based QA generation.

python

files = [
    "/Users/vasim/Downloads/Data Engineering for Foundation Models: The Alchemist’s Cookbook | Mohammed Vasim.pdf",
    "/Users/vasim/Downloads/The Tale of Meaningful Vectors: Contrastive Learning for Text Embeddings, Told with Pen and Paper | Mohammed Vasim.pdf",
]

python

def convert_pdf_to_md(files: list[str], write_images: bool = False):
    """Convert pdf to markdown"""
    content = {}

    for file in files:
        filename = os.path.basename(file)
        img_path = f"../data/images/{filename}"
        content[filename] = pymupdf4llm.to_markdown(
            file, image_path=img_path, write_images=write_images
        )

        if write_images:
            content[img_path] = [
                os.path.join(img_path, image_path)
                for image_path in os.listdir(img_path)
            ]

    return content

python

md_texts = convert_pdf_to_md(files)

Splitting into Topic Chunks

chunk_markdown_by_topic splits the markdown at heading boundaries, returning a list of {"title": ..., "content": ...} dicts. Each dict represents one conceptual section of the document.

Chunking at headings rather than at fixed token counts keeps each chunk semantically coherent. A fixed-token chunker cuts mid-paragraph, mixing concepts that belong to different topics and making fact extraction less precise.

python

from utils import chunk_markdown_by_topic

topic_wise = []

for text in list(md_texts.values()):
    topic_wise.extend(chunk_markdown_by_topic(text))

Each chunk has a title (the heading text) and content (the section body). These become the unit of work for Stage 2.

python

topic_wise

Out[17]:
[{'title': 'Data Engineering for Foundation Models: The Alchemist’s Cookbook | On this page',
  'content': '==> picture [521 x 129] intentionally omitted <==\n----- Start of picture text -----<br>May 17, 2026 • 7 min read • By Mohammed Vasim<br>0<br>data-engineering llm foundation-models dataset-curation ai<br>----- End of picture text -----<br>\n\nIf you’ve ever trained a model, you know the truth: you can have the fanciest architecture and endless GPU hours, but if your data is bad, your model will be bad. Data is what gives a language model its personality, its skills, and its blind spots. Without good data, you’re just heating expensive hardware.\nThink of data engineering as a kind of alchemy. You start with messy, raw material from the world—text, code, conversations—and through careful steps, you transform it into something that makes a model come alive. This isn’t magic; it’s method. Here’s how it works, told in plain words with the actual techniques we use every day.'},
 {'title': 'Gathering the Raw Stuff (Prima Materia)',
  'content': 'question: what data do you need, how much, and what does “good” even mean? If you’re building a chatbot, you need conversation data. If you want the model to use tools like a calculator or search engine, you need tool-use data, where each example shows the model calling an API and ge!ing a result. Some data is single-\nturn (one question, one answer); multi-turn data teaches the model to handle backand-forth conversations.\nYou’ll pull from many places: web crawls, code repositories, books, transcripts. This raw mix is messy, but it’s your starting gold ore. The key is knowing that di"erent training phases want di"erent data: pre-training needs massive token counts, while fine-tuning needs highly curated examples.'},
 {'title': 'Cleaning House: The First Fire',
  'content': 'Raw data is full of junk. Duplicate documents, HTML tags, private information (PII), toxic content, and near-identical copies that will make the model memorize instead of learn. The process of fixing this is called data processing.\nWe use deduplication to remove repeats. Exact dedup fuzzy dedup (often with a technique called MinHashLSH) catches documents that are almost the same, like the same news article on ten di"erent sites. For huge webscale datasets, new methods like LSHBloom help us do this without running out of memory.\nWe also strip forma!ing junk, trailing spaces, and invisible characters. Every stray token is like static noise that confuses the learning. This stage is all about burning away the useless bits without losing the valuable, rare information.'},
 {'title': 'Making It All Uniform',
  'content': 'Data comes in a hundred shapes: JSON, plain text, HTML, chat logs with weird timestamps. To train a model, everything must follow the same structure. For instruction fine-tuning, we need (instruction, response) pairs. For preference fine-tuning (teaching the model what humans prefer), we need (instruction, winning_response, losing_response) . If you’re building a reward model, you might score each response.\nEnsuring correct forma!ing is part of data quality. One misplaced newline, one missing colon, and the model wastes e"ort trying to figure out the pa!ern you broke. Think of this as dissolving all your ingredients into a consistent liquid so they can blend perfectly later.'},
 {'title': 'Separating the Gold from the Sand | The Right Mix: Data Coverage',
  'content': 'Now we get picky. This is where we apply the strictest rules of data quality. In Chip Huyen’s book, she lists six dimensions of quality for fine-tuning data:\ndata. The famous LIMA paper showed that just 1,000 carefully chosen examples on a 65B model could produce answers preferred over GPT-4 nearly half the time. The Yi model team found 10K clean instructions surpassed hundreds of thousands of sloppy ones. Less is often more—but only if every example is a gem.\n\nQuality alone isn’t enough. The dataset must cover many situations, or the model will fail on anything slightly unusual. This is data coverage—making sure you have a representative mix.\nIf in real life 80% of customer questions are about shipping, roughly 80% of your training should be about shipping. But you also need the rare, di#cult cases: people who talk in dialect, ask contradictory things, make typos, or use emojis. That’s the long tail. New techniques like Feature Activation Coverage measure diversity in the model’s internal language, and combinatorial coverage helps hunt down missing combinations of a!ributes. Coverage is what makes a model robust instead of bri!le.'},
 {'title': 'Adding Purpose: Annotation and Complex Behaviors',
  'content': 'Now the data gets its job. Annotation is the process of labeling examples with the desired behavior—like writing the model’s textbook.\nFor simple tasks, it’s straightforward: “Question: What’s the capital of France? Answer: Paris.” But modern models need much more. Two advanced behaviors make annotation especially tricky.\nChain-of-thought (CoT) data teaches the model to reason step-by-step. Instead of just the final answer, you need a full explanation. Creating these explanations is slow and expensive. Domain experts often skip steps unknowingly. Many teams now use ensembles of AI agents to generate and refine CoT data, cu!ing costs while keeping quality.\nTool use data teaches the model to call APIs, search the web, or run code. Humans and models use tools di"erently. A human opens a browser; a model prefers an API. So we often create synthetic tool-use data by simulating environments. For example, the SYNTHAGENT framework simulates a user and a set of mock tools to generate realistic interaction sequences. Meta’s Llama 3 even designed a special multi-message chat format where each turn can have multiple messages going to di"erent places (user, code interpreter, search tool), with special tokens marking where each turn starts and ends.\nAnnotation is where the dataset gets its soul, but it’s also where the largest costs and hardest work happen.'},
 {'title': 'Creating More from Less: Synthesis and Augmentation | The Responsibility: Governance and Ethics',
  'content': 'Sometimes you have great examples but not enough. Data augmentation and synthesis let you stretch them.\nAnswer augmentation enriches existing responses.\nFor images, methods like ALIA use vision-language models to automatically edit\nimages while preserving their core content, generating more training variety. The danger is that synthetic data can be too similar to its “teacher” model, making your final model echo the teacher’s biases. Techniques like CorrSynth fight this with smarter sampling to keep diversity real.\nThink of it as distilling the essence of a perfect example to create new, equally powerful ones.\n\nend up with a cohesive, high-quality dataset. This is your philosopher’s stone: a small but incredibly potent collection that can transform a base model.\nYou version it carefully using tools like DVC (Data Version Control) , so every training run is reproducible. Standards like Croissant 1.1 automatically track where every piece of data came from, making debugging and compliance easier. The dataset is now ready to be fed into fine-tuning.\n\nIf you taught it chain-of-thought, it will reason before answering. If you taught it tool use, it will call APIs without being told. If your coverage was broad, it will handle messy real-world inputs gracefully.\nappear. Silences you left will become topics the model can’t discuss. Skills you carefully wove in will shine. You didn’t just train a model; you shaped a mind, and your fingerprints are all over it.\n\nWith great power comes paperwork—and good reason. Datasets often contain personal information, copyrighted material, or toxic content. Data governance is about pu!ing automated guardrails in place.\nAutomated policy triggers: If a dataset contains PII, the system demands extra verification and a stated purpose before granting access.\nCompliance checks: Pipelines automatically scan against laws like GDPR, CCPA, and the EU AI Act.\nThe job has grown so much that what used to be a side task for two people (like in GPT-3) now involves dozens of specialized roles: precision annotators, domain experts, quality engineers, workflow managers. The message is clear: data work is no longer an afterthought. It’s the center of the whole AI project.'},
 {'title': 'logue: Toil, Tears, Sweat, and a Little Magic',
  'content': 'Chip Huyen wrote that “data will mostly just be toil, tears, and sweat.” She wasn’t wrong. You’ll spend endless nights cleaning formats, fixing inconsistent labels, debating with compliance, and wondering if it’s worth it.\nand it will answer calmly, step-by-step, using the right tool, and ending with just the right tone. In that moment, you’ll see the soul you poured into those thousands of examples looking back at you.\nThat’s the alchemy of data engineering for LLMs. No magic, just method—but the result feels like nothing less.\n\n|Name||Email||---|---|---||Your|name|you@example.com|\nComment\nYour comment...\nPost Comment'},
 {'title': 'Back to blog | On this page',
  'content': 'The Tale of Meaningful Vectors: Contrastive Learning for Text Embeddings, Told with Pen and Paper\nMay 17, 2026 • 9 min read • By Mohammed Vasim 0 contrastive-learning text-embeddings deep-learning\n\nImagine you are the librarian of a vast, magical library. Books are sca!ered everywhere. Your job is to arrange them so that books on similar topics sit near each other. A visitor whispers a few words, and you must instantly retrieve the perfect book. How would you teach an apprentice to do this?\nYou would show them examples: “These two books are about the same thing – put them together. That one is unrelated – push it far away.” This, in essence, is contrastive learning. Today, we will build a text embedding model from scratch using exactly this idea. No black boxes. We will do the math with pencil and paper, trace every number, and then connect our tiny creation to the mighty models like SimCSE and Qwen3 Embedding that power modern retrieval. By the end, you will have walked the path from zero to advanced, and it will feel as natural as sorting that library.'},
 {'title': '1. The Magic Library: Why Embeddings Matter',
  'content': 'In our digital library, every book is a sentence. We want to represent each sentence as a point in space – a vector – such that sentences with similar meaning lie close together, and dissimilar ones lie far apart. This is the job of a text embedding model.\nBut how do we train such a model? We need a teaching signal: "this sentence means roughly the same as that one" (positive pair) or "these two are unrelated" (negative pair). Contrastive learning is the art of sculpting the vector space with these signals.\nLet’s start with a tiny universe. We’ll use a vocabulary of just a few words and sentences of only two or three words. We’ll build a simple encoder, define a loss, and tune the numbers by hand.'},
 {'title': '2. Our Tiny Universe: Sentences, Words, and Vectors',
  'content': 'We’ll pretend we have the following word embeddings (initialised randomly, each word is a 2-dimensional vector):\n==> picture [523 x 273] intentionally omitted <==\n----- Start of picture text -----<br>Word Embedding<br>a [0.10, 0.20]<br>cat [0.30, 0.40]<br>sleeps [0.50, 0.10]<br>dog [0.35, 0.45]<br>runs [0.45, 0.15]<br>the [0.05, 0.05]<br>loud [0.60, 0.60]<br>noise [0.70, 0.20]<br>----- End of picture text -----<br>\nOur encoder will be simplicity itself: a sentence’s vector is the average of its words’ embeddings. For example:\nSentence A: "a cat sleeps"\nWords: a, cat, sleeps\nSum = [0.10+0.30+0.50, 0.20+0.40+0.10] = [0.90, 0.70]\nAverage = [0.30, 0.2333] (Let’s round to [0.30, 0.23])\nSentence B: "a dog runs"\nSum = [0.10+0.35+0.45, 0.20+0.45+0.15] = [0.90, 0.80]\nAverage = [0.30, 0.2667] (≈ [0.30, 0.27])\nThese two sentences are similar (both about pets doing actions). Their vectors are already close! Now a completely unrelated sentence:\nSentence C: "the loud noise"\nSum = [0.05+0.60+0.70, 0.05+0.60+0.20] = [1.35, 0.85]\nAverage = [0.45, 0.2833] (≈ [0.45, 0.28])\nNow we need a loss function that says: Pull A and B even closer; push A and C far apart.'},
 {'title': '3. The Heart of Contrast: InfoNCE Loss | 3.1 Step-by-step by hand',
  'content': "The most popular contrastive loss for modern text embeddings is InfoNCE (Info Noise-Contrastive Estimation). It treats the problem as a classification: among one positive and many negative candidates, which one is the correct match?\nThe formula looks intimidating but is a gentle giant:\nexp (sim( q , k +)/ _τ ) L = −log K ∑ i =1 exp (sim( q , k  i_ )/ τ )\nLet’s unpack it with our example.\n\nk +: the positive key (sentence B, the correct match).\nki : a set that includes the positive and all negatives. Here we will have just one negative, sentence C. So K = 2 (B and C). sim: a similarity function; we'll use cosine similarity .\nτ : a temperature parameter (a small number like 0.1) that sharpens the distribution.\n\nFirst, compute the embedding vectors (we already did):\nq = [0.30, 0.23], k + = [0.30, 0.27], k − = [0.45, 0.28] Cosine similarity between two vectors a , b is:\na ⋅ b\nsim( a , b ) = ∥ a ∥∥ b ∥\nCalculate dot products and norms.\nFor q and k_+ :\nq ⋅ k + = 0.30 × 0.30 + 0.23 × 0.27 = 0.09 + 0.0621 = 0.1521 ∥ q ∥= 0.302 + 0.232 = 0.09 + 0.0529 = 0.1429 ≈ 0.378 ∥ k +∥= 0.302 + 0.272 = 0.09 + 0.0729 = 0.1629 ≈ 0.404 0.1521 0.1521 sim( q , k +) = ≈ ≈ 0.996 0.378 × 0.404 0.1527\n(These vectors are extremely similar – almost parallel – by construction.)\nFor q and k_- :\nq ⋅ k − = 0.30 × 0.45 + 0.23 × 0.28 = 0.135 + 0.0644 = 0.1994\n∥ k −∥= 0.452 + 0.282 = 0.2025 + 0.0784 = 0.2809 ≈ 0.530 0.1994 0.1994 sim( q , k −) = ≈ ≈ 0.995 0.378 × 0.530 0.2003\nWait, both similarities are nearly 1? That's because our initial random vectors happened to be highly correlated. That's fine – the loss will still push them based on the labels. However, to be!er see the contrast, let's use a more realistic temperature τ = 0.2.\nCompute the exponentials:\nexp(sim( q , k +)/ τ ) = exp(0.996/0.2) = exp(4.98) ≈145.5\nexp(sim( q , k −)/ τ ) = exp(0.995/0.2) = exp(4.975) ≈144.7\nNow the denominator: 145.5 + 144.7 = 290.2.\nThe probability that the model assigns to the positive match is:\n145.5 P ( k +) = ≈ 0.5012 290.2\nLoss:\nL = −log(0.5012) ≈0.690\nto assign higher probability to the positive pair."},
 {'title': '3.2 What does the gradient do?',
  'content': 'The gradient of the InfoNCE loss with respect to the query  (and similarly for the q keys) boils down to:\n==> picture [355 x 47] intentionally omitted <==\nThis is messy to compute by hand, but the intuition is simple: the gradient pulls the query towards the positive key and pushes it away from all negatives, weighted by how much the model currently confuses them.\nTo avoid the heavy algebra, we can use a simpler contrastive loss that yields identical intuition: the triplet loss or the contrastive loss (Chopra et al.). But since the blog aims at InfoNCE, we can demonstrate the parameter update with a tiny surrogate model where we directly di"erentiate the loss with respect to the word embeddings.\nInstead of doing the full matrix calculus, we can make an approximation: imagine our encoder is just a single linear layer that directly produces the final vector from\nthe average of word embeddings (which it is). The gradient with respect to a word embedding is simply the gradient of the loss with respect to the sentence vector, divided equally among its words.\nWe already computed the loss L = 0.690. To lower it, we need to increase the similarity between q and k+ and decrease the similarity between q and k-.\nA manual "gradient" step: we can nudge the sentence vectors by a small amount in the direction of the positive minus the negative. Specifically, we can adjust the query  towards q k + and away from k −. Let\'s set learning rate η = 0.1.\n==> picture [267 x 47] intentionally omitted <==\nBut for pen and paper, we can just do a vector addition that makes intuitive sense: move q a li!le towards k+ and away from k-.\n==> picture [498 x 37] intentionally omitted <==\nA combined nudge: Δ q = η ( δ + + δ −) = 0.1 × [−0.15, −0.01] = [−0.015, −0.001] . New q = [0.30-0.015, 0.23-0.001] = [0.285, 0.229].\nNow recompute similarity with k+ and k-; you’ll see the gap widen. Repeat many times, and the positive pair dominates.\nNow distribute this sentence-level gradient to the word embeddings. Each word in the sentence gets 1/(sentence length) of the gradient. We update each word’s embedding accordingly. This is exactly how backpropagation tunes the parameters.\nAfter thousands of such tiny updates across millions of sentence pairs, the word embeddings (and in a real model, the Transformer weights) learn to map semantically similar sentences close together.'},
 {'title': '4. The Research Paper That Changed the Game: SimCSE',
  'content': 'If you want to point to one paper that made contrastive learning for text embeddings simple and spectacular, it’s SimCSE (Simple Contrastive Learning of Sentence Embeddings) by Gao et al., 2021.\nThe genius of SimCSE lies in its unsupervised approach: take a sentence, pass it through a pre-trained language model twice with di"erent dropout masks. The two resulting embeddings are treated as a positive pair. Why does this work? Because the model learns to ignore the randomness of dropout and focus on the stable semantic content. All other sentences in the batch become negatives.\nThis is beautifully elegant: no need for hand-crafted augmentations like word deletion or swapping. Dropout, already present in the Transformer, provides just enough noise. The loss function is exactly the InfoNCE we calculated, with cosine similarity and temperature.\nSimCSE achieved state-of-the-art on the Semantic Textual Similarity (STS) benchmark, beating many supervised models. Its simplicity ignited a wave of research.\nKey takeaway: The contrastive framework works as long as you have a smart way to generate positive pairs from unlabeled text. SimCSE used dropout; later works (like ConSERT, LaBSE) explored other augmentations.'},
 {'title': '5. The Apex: Qwen3 Embedding and the Modern Recipe',
  'content': 'Now let’s zoom out from our toy 2-dimensional library to the real thing. Qwen3 Embedding (2025) is one of the latest and greatest text embedding models, and its architecture and training directly build on the contrastive ideas we’ve just explored.\nHere’s how Qwen3 Embedding stands on the shoulders of giants and adds its own magic:\n\nUnlike SimCSE’s BERT (an encoder-only model), Qwen3 uses a decoder-only\nTransformer – the same architecture as GPT. It takes a sentence, appends a special\n[EOS] token, and then extracts the hidden state of that token from the last layer as the sentence embedding. This leverages the powerful sequence understanding of large language models.\n\nQwen3 Embedding is not trained in one go. It follows a meticulous three-stage recipe:\n\nQwen3 Embedding can be steered with natural language instructions. For example, when embedding a query, you prepend: “Represent this sentence for searching relevant passages:” followed by the query. The document side remains instructionfree. This allows a single model to perform well on diverse tasks (classification, clustering, retrieval) just by changing the prefix.\n\nEver wanted to shrink your embeddings from 4096 dimensions to 256 without retraining? Qwen3 supports MRL, meaning the first few dimensions already contain\ndimensions and still get decent results, saving storage in vector databases.\n\nYour tiny pen-and-paper model is not so di"erent from Qwen3 Embedding. Both do the same dance: take a query and candidates, compute similarity, and use the InfoNCE loss to pull matches together and push mismatches apart. The di"erences are scale and sophistication:\nEncoder : Our averaging trick became a giant Transformer.\nContrastive learning is the quiet hero behind every modern semantic search system. The next time you type a query into a chatbot or a document search bar, remember that invisible librarian, tirelessly sorting vectors so your answer appears like magic.\n\nNow that you’ve seen the theory and the giants, try this yourself:\nYou’ll have built a contrastive text embedding model from scratch, and you’ll never look at a vector database the same way again.\nHappy learning, and may your vectors always be meaningful!\n\nName Email Your name you@example.com Comment Your comment... Post Comment'}]

Stage 2: Extracting Atomic Facts

The goal here is to compress each topic chunk into statements that are individually correct, self-contained, and verifiable. Calling these "atomic facts" is a constraint, not just naming — it forces the extraction prompt away from paraphrase and toward precise claims.

This decomposition matters for the next stage: when the QA generator writes an answer, it must synthesize from these statements. That means hallucinated answers are easier to detect — an answer that cannot be grounded in any extracted fact is a signal that the prompt or model needs adjustment.

Setting Up the LLM

The NVIDIA API exposes an OpenAI-compatible endpoint, so ChatOpenAI works without a custom provider class. Swapping BASE_URL and CHAT_MODEL_NAME is all that's needed to point this pipeline at any OpenAI-compatible backend — Ollama, vLLM, or the actual OpenAI API.

API_KEY reads from the environment. Any notebook that embeds credentials in source will eventually commit them to version control.

python

from langchain_openai import ChatOpenAI

BASE_URL = "https://integrate.api.nvidia.com/v1"
API_KEY = os.getenv("NVIDIA_API_KEY")
CHAT_MODEL_NAME = "openai/gpt-oss-120b"

chat_model = ChatOpenAI(
    model=CHAT_MODEL_NAME,
    api_key=API_KEY,
    base_url=BASE_URL
)

A quick sanity check confirms the endpoint is reachable and the key is valid before running any expensive batch operations.

python

chat_model.invoke("hello")

Out[19]:
AIMessage(content='Hello! 👋 How can I help you today?', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 35, 'prompt_tokens': 66, 'total_tokens': 101, 'completion_tokens_details': None, 'prompt_tokens_details': None}, 'model_provider': 'openai', 'model_name': 'openai/gpt-oss-120b', 'system_fingerprint': None, 'id': 'chatcmpl-a51f066166a10732', 'finish_reason': 'stop', 'logprobs': None}, id='lc_run--019e5a84-fc37-79a2-92ad-a0a042227266-0', tool_calls=[], invalid_tool_calls=[], usage_metadata={'input_tokens': 66, 'output_tokens': 35, 'total_tokens': 101, 'input_token_details': {}, 'output_token_details': {}})

Defining the Output Schema

with_structured_output(AtomicFacts) passes this schema to the LLM as a tool definition, forcing the response to conform to the structure. The result is deserialized directly into AtomicFacts — no parsing, no regex.

Wrapping each fact in its own AtomicFact model rather than returning a plain list[str] lets the field description appear in the tool schema the LLM sees. That description nudges output quality more reliably than prompt instructions alone.

python

from pydantic import BaseModel, Field

class AtomicFact(BaseModel):
    """AtomicFact"""
    fact: str = Field(description="A single self-contained factual statement, declarative sentence")

class AtomicFacts(BaseModel):
    """AtomicFacts"""
    facts: list[AtomicFact] = Field(description="List of atomic facts extracted from the text chunk")

The Extraction Prompt

The six rules in SYSTEM_PROMPT_EXTRACT_FACTS do most of the work:

Rules 1–2 enforce the atomic constraint: one declarative sentence, no meta-framing
Rule 3 filters metadata noise that appears when PDFs are converted from blog posts (dates, navigation, image dimensions)
Rules 4–5 prevent redundancy and hallucination
Rule 6 frames extraction as downstream task context, which empirically improves relevance

The user template passes the section title alongside the content. Including the title helps the model resolve pronouns and implicit references that appear within a section.

python

SYSTEM_PROMPT_EXTRACT_FACTS = """\
You are a meticulous fact extractor for building a high-quality training dataset.

Read the text and extract **meaningful, substantive facts** that represent the core domain knowledge. Focus on:

- Definitions, mechanisms, processes, categorizations, comparisons, causal relationships, and named techniques/concepts explicitly stated.
- Actionable information such as what a concept is, how it works, when to use it, what its properties are, or how it differs from another concept.

RULES:
1. Each fact must be a single, self-contained declarative sentence.
2. State the fact directly — NEVER use meta-framing like "The text says...", "The author states...", "The article mentions...", "According to...".
3. Skip trivial/metadata: dates, author names, reading time, picture dimensions, page structure, layout information, navigation elements.
4. Skip redundant restatements of the same idea. If two sentences convey the same core fact, keep only one.
5. Do not add, infer, or speculate beyond what is explicitly stated.
6. Extract only facts that would be useful as ground-truth context for answering a question about this topic."""

USER_PROMPT_TEMPLATE = """\
Title: {title}

Content:
{content}
"""

The chain uses .with_structured_output(AtomicFacts) to bind extraction to the schema. chain.invoke passes both system and user messages in a single call, and the response comes back as an AtomicFacts object — the list comprehension pulls out just the string values.

python

chain = chat_model.with_structured_output(AtomicFacts)

def extract_facts(chunk: dict) -> list[str]:
    """Extract facts"""
    response = chain.invoke([
        {"role": "system", "content": SYSTEM_PROMPT_EXTRACT_FACTS},
        {"role": "user", "content": USER_PROMPT_TEMPLATE.format(
            title=chunk["title"], content=chunk["content"]
        )}
    ])
    return [f.fact for f in response.facts]

Running extraction on a single chunk before batching confirms the prompt is producing the right kind of output. The facts below come from the first topic section.

python

facts1 = extract_facts(topic_wise[0])
facts1

Out[27]:
['If data is bad, the resulting model will be bad, regardless of how fancy the architecture is or how many GPU hours are used.',
 'Data provides a language model with its personality, its skills, and its blind spots.',
 'Training without good data only consumes expensive hardware without improving model performance.',
 'Data engineering converts messy, raw material such as text, code, and conversations into refined data that enables a model to function effectively.',
 'Data engineering is a methodical process rather than a magical one.']

Extracting Facts from All Chunks

The loop processes the first five topic chunks, appending each chunk's facts as a list. golden_facts ends up as a list of lists — one inner list per topic chunk.

Five chunks is enough to validate the pipeline end-to-end. In production, run all chunks and add retry logic for API failures.

python

from tqdm import tqdm

golden_facts = []

for topic in tqdm(topic_wise[:5], desc="Generating golden facts..."):
    golden_facts.append(extract_facts(topic))

Generating golden facts...: 100%|██████████| 5/5 [00:18<00:00,  3.62s/it]

The tqdm output shows roughly 3–4 seconds per chunk — mostly network latency. The extracted facts below show the compression ratio: a full topic section reduces to 4–12 precise statements.

python

golden_facts

Out[38]:
[["Data quality determines a model's performance, so even with advanced architecture and extensive GPU resources, bad data results in a bad model.",
  "A language model's personality, skills, and blind spots are derived from the data it is trained on.",
  'Training on poor data wastes computational resources, effectively only heating expensive hardware without producing a useful model.',
  'Data engineering transforms raw, messy material such as text, code, and conversations into a refined dataset that enables a model to function effectively.'],
 ['When building a chatbot, conversation data is required.',
  'To enable a model to use tools such as a calculator or search engine, tool‑use data is needed, where each example shows the model calling an API and receiving a result.',
  'Single‑turn data consists of one question and one answer.',
  'Multi‑turn data teaches a model to handle back‑and‑forth conversations.',
  'Raw training data can be sourced from web crawls, code repositories, books, and transcripts.',
  'The raw data mix is messy but serves as the initial material for training.',
  'Different training phases require different data: pre‑training requires massive token counts, while fine‑tuning requires highly curated examples.'],
 ['Raw data often contains junk such as duplicate documents, HTML tags, private information (PII), toxic content, and near‑identical copies that cause models to memorize rather than learn.',
  'The process of fixing junk in raw data is called data processing.',
  'Deduplication is a data‑processing step used to remove repeat documents.',
  'Exact deduplication removes identical copies of documents.',
  'Fuzzy deduplication detects documents that are almost the same, such as the same news article appearing on multiple sites.',
  'MinHashLSH is a technique commonly employed for fuzzy deduplication.',
  'LSHBloom is a method designed for deduplication on huge web‑scale datasets that operates without exhausting memory.',
  'Data‑processing pipelines strip formatting junk, trailing spaces, and invisible characters from text.',
  'Stray tokens act as static noise that can confuse a model’s learning process.',
  'The cleaning stage aims to eliminate useless bits while preserving valuable, rare information.'],
 ['Training data must follow a uniform structure for model training.',
  'Instruction fine‑tuning requires (instruction, response) pairs.',
  'Preference fine‑tuning requires (instruction, winning_response, losing_response) triples.',
  'Building a reward model may involve scoring each response.',
  'Incorrect formatting, such as misplaced newlines or missing colons, causes the model to waste effort trying to interpret broken patterns.',
  'Ensuring correct formatting is part of data quality.'],
 ['Chip Huyen’s book lists six dimensions of quality for fine‑tuning data.',
  'The LIMA paper showed that 1,000 carefully chosen examples on a 65‑billion‑parameter model could produce answers preferred over GPT‑4 nearly half the time.',
  'The Yi model team found that 10,000 clean instructions outperformed hundreds of thousands of sloppy ones.',
  'Fewer high‑quality examples can be more effective than many low‑quality ones.',
  'Quality alone is insufficient; a dataset must also cover many situations, or the model will fail on slightly unusual inputs.',
  'Data coverage means ensuring the training set contains a representative mix of examples.',
  'If 80% of real‑world customer questions are about shipping, roughly 80% of the training data should be about shipping.',
  'Training data should also include rare, difficult cases such as dialects, contradictory queries, typos, and emojis.',
  'The “long tail” refers to those rare, difficult cases that occur infrequently.',
  'Feature Activation Coverage is a technique that measures diversity in the model’s internal language representations.',
  'Combinatorial coverage is a technique that helps identify missing combinations of attributes in the data.',
  'Adequate coverage makes a model robust instead of brittle.']]

Stage 3: Generating QA Pairs

The extracted facts become the ground truth for this stage. The QA generator sees only those facts — not the original document — which prevents it from synthesizing information outside the approved source. Each question must be answerable from the facts list alone.

The Alpaca Schema

The Alpaca format uses three fields: instruction (the question), input (optional context, empty for single-turn QA), and output (the answer). It's the most widely supported fine-tuning format — compatible with Axolotl, LLaMA Factory, and most Hugging Face training scripts.

Wrapping pairs in AlpacaQAPairs lets the LLM return all pairs in one structured response, avoiding the overhead of separate API calls per question.

python

class AlpacaQAPair(BaseModel):
    instruction: str = Field(description="The question / user instruction")
    input: str = Field(description="Optional context (empty string for single-turn)")
    output: str = Field(description="The answer grounded in the provided facts")

class AlpacaQAPairs(BaseModel):
    pairs: list[AlpacaQAPair] = Field(description="List of diverse instruction-response pairs")

The Generation Prompt

Two constraints here are easy to overlook:

Question variety — without explicit requirements, models default to "what is X?" for every fact. The list of question types (how, why, compare, enumerate, true/false) distributes the dataset toward harder, more informative examples.

Answer length — the 200–300 word target forces context and supporting detail rather than one-sentence answers. Short answers make poor training examples because they don't demonstrate the reasoning the fine-tuned model should learn.

Forbidding meta-framing ("Based on the facts...") keeps answers in the voice of the target model, not the synthesis pipeline.

python

SYSTEM_PROMPT_GENERATE_QA = """\
You are an expert at creating diverse, high-quality instruction data.

Given a list of atomic facts, generate a set of diverse instruction-response pairs.

Requirements:
- Each question must be answerable entirely from the provided facts — do not use external knowledge.
- Vary question types across the set: include "what", "how", "why", "compare/contrast", "list/enumerate", "explain", "define", and "true/false" or "yes/no" questions.
- Vary difficulty: some questions should be simple fact retrieval (one fact), others should require synthesizing 2-3 facts together.
- Each answer must be **200–300 words long** and explain the concept thoroughly: define it, provide context, include supporting details, and connect related facts — all strictly from the provided facts.
- Do not pad with filler or repetition. Every sentence should add substantive information.
- State answers directly without meta-framing like "Based on the facts...".
- Do NOT generate questions about the text itself (e.g. "What does the article say about X?"). Generate questions a real user would ask.
- Output each pair with instruction (the question), input (empty string), and output (the answer)."""

qa_chain reuses the same chat_model instance bound to the AlpacaQAPairs schema. The loop iterates over golden_facts, formats each fact list as a bulleted string, and accumulates all pairs into all_pairs.

python

qa_chain = chat_model.with_structured_output(AlpacaQAPairs)

all_pairs = []
for chunk_facts in tqdm(golden_facts, desc="Generating QA pairs..."):
    facts_text = "\n".join(f"- {f}" for f in chunk_facts)
    resp = qa_chain.invoke([
        {"role": "system", "content": SYSTEM_PROMPT_GENERATE_QA},
        {"role": "user", "content": f"Atomic facts:\n\n{facts_text}"}
    ])
    all_pairs.extend(p.model_dump() for p in resp.pairs)

Skimming a few pairs at this point is worth the time — answer quality directly determines fine-tune quality, and issues with the prompt show up clearly here.

python

all_pairs

Out[45]:
[{'instruction': "What is the relationship between data quality and a language model's performance?",
  'input': '',
  'output': "Data quality is the primary determinant of a language model's performance. When the dataset contains accurate, relevant, and well‑structured information, the model learns patterns that translate into reliable predictions and useful outputs. Conversely, if the data are noisy, inaccurate, or misaligned with the intended tasks, the model inherits those deficiencies, resulting in poor performance regardless of how sophisticated the architecture is or how many GPUs are deployed. The direct link between data quality and performance explains why models trained on high‑quality corpora consistently outperform those built on inferior material. This principle also extends to the model's emergent characteristics: the personality, skills, and blind spots that appear during inference are all derived from the training data, meaning that low‑quality inputs produce limited abilities and pronounced gaps. Moreover, a refined dataset, produced through data engineering, elevates data quality by converting raw, messy material into a form that the model can exploit efficiently. In sum, the chain from data quality through refined datasets to model behavior demonstrates that without good data, even the most advanced computational resources cannot deliver a successful model."},
 {'instruction': 'How does training on poor data affect computational resources such as GPUs?',
  'input': '',
  'output': 'Training on poor data consumes computational resources without yielding meaningful improvements, effectively turning valuable GPU cycles into heat generators. The process of feeding large volumes of low‑quality text, code, or conversation data into a model forces the hardware to perform extensive calculations, yet the resulting model internalizes flawed patterns that do not enhance performance. This waste manifests as electricity usage and hardware wear while the model fails to acquire useful knowledge, because the data do not provide the signal needed for learning. Consequently, organizations incur costs for GPU time, power, and cooling without gaining a functional model, undermining the return on investment. Data engineering can mitigate this inefficiency by filtering out irrelevant or noisy inputs before training, thereby ensuring that the remaining dataset possesses sufficient quality to justify the computational expense. By transforming raw material into a refined dataset, data engineering aligns the computational effort with the potential for model improvement, preventing the scenario where GPUs merely heat up while producing a subpar system.'},
 {'instruction': "Why does a language model's personality, skills, and blind spots derive from its training data?",
  'input': '',
  'output': "A language model internalizes patterns, facts, and stylistic cues present in the data it processes, which directly shapes its observable behavior. The personality that emerges—whether formal, casual, humorous, or technical—mirrors the tone and distribution of expressions found in the training corpus. Likewise, the skills the model exhibits, such as code generation or answering factual questions, correspond to the domains and task formats represented in the data. Blind spots appear where the dataset lacks coverage; the model has no basis to infer reliable answers about underrepresented topics, leading to systematic gaps. Because the model does not possess external knowledge beyond what it has seen, every attribute of its output can be traced back to the corpus that provided the learning signal. This dependence underscores the importance of curating a balanced, comprehensive dataset: only by supplying diverse and accurate examples can developers steer the model toward a desired personality, equip it with the intended competencies, and minimize undesirable blind spots. Data engineering plays a crucial role by converting raw, messy inputs into a refined dataset that faithfully reflects the intended characteristics, thereby directly influencing the model's emergent traits."},
 {'instruction': 'List the key transformations that data engineering performs to prepare raw material for a language model.',
  'input': '',
  'output': "Data engineering applies a series of transformations to raw, messy material in order to produce a refined dataset suitable for model training. First, it collects heterogeneous sources such as raw text, source code, and conversational logs, ensuring that the full scope of intended content is captured. Second, it cleans the collected material by removing duplicates, correcting formatting errors, and filtering out irrelevant or harmful entries, thereby improving overall data quality. Third, it normalizes the content, standardizing tokenization, encoding, and language conventions so that the model can process inputs consistently. Fourth, it augments the dataset when needed, balancing underrepresented topics to reduce blind spots and broaden the model's skill set. Fifth, it partitions the refined data into training, validation, and test subsets, enabling effective monitoring of model performance without contamination. Finally, it documents the provenance and preprocessing steps, providing transparency for future audits and reproducibility. Each of these steps contributes to converting raw, unstructured inputs into a high‑quality, structured corpus that empowers the language model to function effectively and achieve reliable performance."},
 {'instruction': 'Compare the outcomes of training a language model with high‑quality data versus low‑quality data.',
  'input': '',
  'output': 'Training a language model with high‑quality data yields a system that performs well, exhibits a coherent personality, and possesses the intended skills, while also minimizing blind spots. The refined dataset, produced through data engineering, supplies clear, accurate signals that the model can learn, leading to efficient use of GPU resources and a favorable return on computational investment. In contrast, training with low‑quality data generates a model that performs poorly, displays an inconsistent or undesired personality, and suffers from pronounced blind spots due to missing or noisy information. The computational effort expended on such training mainly heats expensive hardware without creating a useful product, demonstrating a waste of resources. High‑quality data also enables the model to function effectively across diverse tasks because the underlying patterns are representative and well‑structured. Low‑quality data, however, propagates errors and ambiguities, causing the model to produce unreliable outputs and limiting its applicability. Therefore, the primary distinction lies in the alignment between data quality and model utility: good data translates into strong performance and efficient resource use, whereas bad data results in a weak model and squandered computational power.'},
 {'instruction': 'Explain how data engineering helps prevent wasted GPU resources during model training.',
  'input': '',
  'output': 'Data engineering safeguards GPU resources by ensuring that only high‑quality, relevant material reaches the training pipeline. The process begins with the systematic collection of raw inputs such as text, code, and conversation logs, followed by rigorous cleaning that eliminates duplicates, noise, and invalid entries. By filtering out these wasteful components early, the dataset becomes more compact and informative, reducing the number of unnecessary computation cycles. Normalization and standardization steps further align the data format, allowing the model to process each token efficiently without extra overhead for handling irregularities. When the refined dataset is balanced and comprehensive, the model learns effectively in fewer epochs, meaning that the GPUs spend less time looping over insufficient or misleading examples. Consequently, the training process consumes less electricity, generates less heat, and preserves hardware lifespan, because the computational work directly contributes to model improvement rather than merely warming the machinery. Data engineering thus transforms raw, messy material into a curated corpus that maximizes the informational yield per GPU operation, turning potential waste into productive training effort.'},
 {'instruction': 'True or false: Advanced model architecture can fully compensate for bad training data and still produce a good language model.',
  'input': '',
  'output': "False. Even the most advanced architecture and extensive GPU resources cannot overcome the fundamental limitation that data quality determines a model's performance. Bad data leads to a bad model because the model's personality, skills, and blind spots are derived from the training data, and training on poor data wastes computational resources, effectively heating expensive hardware without producing a useful model. Therefore, superior architecture alone cannot compensate for deficient data, and a refined dataset produced by data engineering remains essential for achieving a good language model."},
 {'instruction': 'Define conversation data and explain why it is required when building a chatbot.',
  'input': '',
  'output': 'Conversation data consists of the textual exchanges that occur between participants in a dialogue. It provides the examples a chatbot observes during training, showing how questions are phrased, how answers are structured, and how turn‑taking is managed. By exposing the model to real conversational patterns, the data teaches the system to generate coherent responses that fit the flow of a discussion. Without conversation data, the model would lack any evidence of how language is used in interactive contexts, making it incapable of producing appropriate replies. The requirement is fundamental: a chatbot’s purpose is to communicate, and the only way to learn communication is from examples of communication. Consequently, conversation data serves as the cornerstone of the learning process, ensuring the model internalizes both content and the dynamics of dialogue.'},
 {'instruction': 'What is tool‑use data and what role does it play in enabling a model to use external tools like a calculator or search engine?',
  'input': '',
  'output': 'Tool‑use data is a collection of examples that demonstrate a model invoking an external API and receiving the resulting output. Each example includes a prompt that triggers a tool call, the exact API request made, and the response returned by the tool. This data teaches the model the sequence of actions required to integrate external functionality: recognizing when a tool is needed, formatting the request correctly, and interpreting the result to continue generating an answer. By training on tool‑use data, the model learns to treat the tool as an extension of its reasoning capabilities, allowing it to offload calculations or information retrieval to specialized services. The presence of these examples bridges the gap between pure language prediction and actionable interfacing, making the model versatile enough to augment its answers with precise calculations or up‑to‑date information drawn from a search engine.'},
 {'instruction': 'Explain how single‑turn data differs from multi‑turn data and what each type teaches a model.',
  'input': '',
  'output': 'Single‑turn data comprises one question followed by one answer, presenting a self‑contained exchange. It teaches a model to map an isolated query directly to an appropriate response, focusing on factual correctness and concise formulation. Multi‑turn data, in contrast, consists of a series of back‑and‑forth exchanges, capturing the evolving context of a conversation. This format trains the model to maintain state, reference earlier statements, handle follow‑up questions, and manage clarification or correction steps. While single‑turn examples build foundational question‑answer mapping, multi‑turn examples develop the ability to track dialogue history, resolve ambiguities across turns, and produce consistent, coherent interactions over longer sessions. Both are essential: single‑turn data supplies the basic response skill, and multi‑turn data extends that skill into the dynamic setting of real conversations.'},
 {'instruction': 'List the sources from which raw training data can be obtained.',
  'input': '',
  'output': 'Raw training data can be sourced from four primary domains: web crawls, which harvest publicly available webpages; code repositories, which provide examples of programming language usage; books, which contribute structured prose and narrative content; and transcripts, which capture spoken language from interviews, lectures, or media. These sources together form a broad and varied corpus, offering the token volume needed for large‑scale language model training.'},
 {'instruction': 'Explain why the raw data mix is described as messy but still serves as the initial material for training.',
  'input': '',
  'output': 'The raw data mix is described as messy because it aggregates content from diverse origins—web crawls, code repositories, books, and transcripts—each with different formats, quality levels, and domains. This heterogeneity leads to inconsistent structure, duplicate information, and varying degrees of relevance. Despite the lack of uniformity, the mixture provides the massive token counts required for the early stages of model development. The sheer volume of varied language patterns, factual statements, and stylistic forms supplies the foundational exposure a model needs to learn general linguistic representations. Consequently, even though the data is uncurated and noisy, it functions as the essential initial material that fuels the broad learning phase before more selective curation occurs in later training stages.'},
 {'instruction': 'Compare the data requirements for pre‑training versus fine‑tuning phases of model development.',
  'input': '',
  'output': 'Pre‑training demands massive token counts, meaning it relies on a huge quantity of raw textual material drawn from web crawls, code repositories, books, and transcripts. The goal is to expose the model to a wide variety of language patterns, allowing it to learn general linguistic structures and world knowledge. In contrast, fine‑tuning requires highly curated examples, where each instance is carefully selected for relevance, quality, and instructional value. Fine‑tuning data often includes conversation data, tool‑use data, and other task‑specific formats that illustrate the desired behavior. While pre‑training focuses on breadth and scale, fine‑tuning emphasizes precision and alignment with specific objectives, such as accurate tool invocation or coherent multi‑turn dialogue. Both phases are necessary: the broad token exposure builds foundational capabilities, and the curated fine‑tuning refines those capabilities to meet targeted performance criteria.'},
 {'instruction': 'Why are highly curated examples required for the fine‑tuning phase?',
  'input': '',
  'output': 'Highly curated examples are required for fine‑tuning because this phase aims to shape the model’s behavior toward specific, high‑quality outcomes. Curated data ensures that each example demonstrates the exact pattern the model should emulate, such as proper tool invocation, coherent multi‑turn interaction, or precise answer formulation. By presenting only well‑structured, error‑free instances, the model receives clear signals about the desired response style and content, reducing the risk of learning undesirable patterns that may be present in the messier raw data. The focused nature of curated examples accelerates alignment with particular tasks, making the model more reliable and consistent when deployed. Hence, rigorous selection and editing of fine‑tuning data are essential to translate the general knowledge acquired during pre‑training into specialized, dependable performance.'},
 {'instruction': 'True or false: Multi‑turn data is unnecessary for building a functional chatbot.',
  'input': '',
  'output': 'False. Multi‑turn data teaches a model to handle back‑and‑forth conversations, which are intrinsic to most chatbot interactions. Without exposure to multi‑turn exchanges, a model would lack the ability to maintain context across turns, manage follow‑up questions, and provide coherent dialogue over extended interactions. Therefore, multi‑turn data is necessary for building a functional chatbot.'},
 {'instruction': 'How does tool‑use data enable a model to call an API and incorporate the result into its answer?',
  'input': '',
  'output': 'Tool‑use data provides concrete examples where a model issues an API call, receives a result, and then uses that result to formulate a final response. Each training instance shows the model the trigger phrase that signals a tool is needed, the exact request syntax, and the format of the returned data. By iterating over many such examples, the model learns to recognize situations that require external assistance, generate the correct API invocation, parse the returned information, and seamlessly embed it within its answer. This process transforms the model from a pure text generator into an agent capable of augmenting its reasoning with external computations or up‑to‑date search results, thereby increasing accuracy and usefulness in tasks that depend on precise calculations or current knowledge.'},
 {'instruction': 'Define data processing and explain why it is necessary for raw text datasets.',
  'input': '',
  'output': 'Data processing is the systematic procedure of fixing junk in raw data. Raw text frequently includes duplicate documents, HTML tags, private information (PII), toxic content, and near‑identical copies that cause models to memorize rather than learn. By applying data processing, these undesirable elements are removed or transformed, allowing a model to focus on meaningful linguistic patterns instead of memorizing repetitive or irrelevant material. The process strips formatting junk, trailing spaces, and invisible characters, which, if left unchecked, would act as static noise and confuse the learning algorithm. Moreover, data processing protects privacy by eliminating PII and reduces the risk of toxic outputs by discarding harmful content. Consequently, the cleaned dataset supports more efficient training, better generalization, and a safer deployment of language models. In sum, data processing turns noisy, unstructured raw data into a refined corpus that preserves valuable information while discarding the junk that would otherwise impair model performance.'},
 {'instruction': 'List the typical steps in a data‑processing pipeline that remove junk from text.',
  'input': '',
  'output': 'A typical data‑processing pipeline follows several sequential steps to cleanse raw text. First, it strips formatting junk such as HTML tags and other markup, thereby eliminating structural clutter that does not convey semantic meaning. Second, it removes trailing spaces and invisible characters, which can introduce unintended token boundaries and confuse downstream tokenizers. Third, the pipeline applies deduplication, starting with exact deduplication to discard identical copies of documents that would otherwise cause the model to memorize repetitive content. Fourth, fuzzy deduplication is performed, often using MinHashLSH, to detect and remove documents that are almost the same, such as the same news article republished across multiple sites. Fifth, stray tokens—isolated symbols or fragments that serve as static noise—are identified and eliminated to prevent confusion during learning. Finally, the cleaning stage reviews the remaining text to ensure that useless bits have been eliminated while preserving valuable, rare information, thereby maintaining a balance between thorough cleaning and information retention. Each step builds upon the previous one to progressively refine the dataset, ensuring that the final corpus is both clean and informative.'},
 {'instruction': 'Explain the difference between exact deduplication and fuzzy deduplication.',
  'input': '',
  'output': 'Exact deduplication and fuzzy deduplication are two distinct data‑processing steps aimed at removing redundant documents, but they differ in the criteria they use to identify duplicates. Exact deduplication removes identical copies of documents; it looks for byte‑for‑byte matches, ensuring that any document that is precisely the same as another is eliminated. This approach is straightforward and effective when duplicates are perfect replicas, such as multiple uploads of the same file. In contrast, fuzzy deduplication identifies documents that are almost the same, not requiring an exact match. It detects near‑identical copies, such as a news article reproduced on different websites with minor editorial changes or formatting variations. To achieve this, fuzzy deduplication employs similarity‑detecting techniques, most commonly MinHashLSH, which can capture the degree of overlap between documents without demanding exact equality. While exact deduplication swiftly removes perfect duplicates, fuzzy deduplication is essential for cleaning large, heterogeneous corpora where near‑duplicate content can still cause models to memorize rather than learn. Both steps together ensure that the dataset is free from redundant material, improving model efficiency and generalization.'},
 {'instruction': 'How does MinHashLSH facilitate fuzzy deduplication in large text collections?',
  'input': '',
  'output': 'MinHashLSH facilitates fuzzy deduplication by providing a scalable method for estimating the similarity between documents without comparing each pair directly. The technique first generates MinHash signatures for each document, which are compact representations that preserve the Jaccard similarity of the underlying token sets. These signatures are then organized using Locality Sensitive Hashing (LSH), which groups together signatures that are likely to be similar. When two documents produce signatures that fall into the same LSH bucket, they are flagged as candidates for fuzzy duplication because their token sets share a high degree of overlap. This process allows the system to detect documents that are almost the same—such as the same news article appearing on multiple sites—while avoiding the computational cost of exhaustive pairwise comparisons. By focusing only on promising pairs, MinHashLSH efficiently identifies near‑identical content even in massive corpora, supporting the fuzzy deduplication step of data processing. The technique thereby reduces redundancy, prevents unnecessary memorization by language models, and helps preserve the diversity of unique information in the cleaned dataset.'},
 {'instruction': 'Compare LSHBloom and MinHashLSH for deduplication on huge web‑scale datasets.',
  'input': '',
  'output': 'LSHBloom and MinHashLSH are both designed to support deduplication on massive datasets, yet they differ in architecture and memory usage. MinHashLSH creates compact MinHash signatures for each document and uses Locality Sensitive Hashing to group similar signatures, enabling fuzzy deduplication by estimating Jaccard similarity. While effective, MinHashLSH requires storing the signatures and hash tables, which can become memory‑intensive as the dataset grows to web scale. LSHBloom, on the other hand, combines LSH with a Bloom filter, a probabilistic data structure that records the presence of items using a fixed amount of memory. By leveraging the Bloom filter, LSHBloom can operate on huge web‑scale datasets without exhausting memory, sacrificing a small false‑positive rate for substantial space savings. Both methods aim to detect near‑duplicate documents, but LSHBloom’s memory‑efficient design makes it more suitable when resources are constrained, whereas MinHashLSH offers finer similarity estimates at the cost of higher memory consumption. In practice, a pipeline might employ MinHashLSH for moderate‑size collections where precise similarity is essential, and switch to LSHBloom when processing billions of webpages where memory efficiency is paramount.'},
 {'instruction': 'Why are stray tokens considered static noise, and how can they affect a model’s learning process?',
  'input': '',
  'output': 'Stray tokens are isolated symbols or fragments that do not contribute semantic meaning, acting as static noise within a text corpus. Because they appear irregularly and lack contextual relevance, stray tokens introduce meaningless patterns that a language model may inadvertently learn. When a model encounters such noise during training, it allocates capacity to model these irrelevant tokens, diverting attention from genuine linguistic structures. This misallocation can degrade the model’s ability to capture meaningful relationships, leading to poorer generalization on downstream tasks. Moreover, stray tokens can disrupt tokenization pipelines, causing inconsistent token boundaries and increasing the variability of the input space. By confusing the model’s learning process, stray tokens hinder efficient representation learning and can inflate the size of the vocabulary needlessly. Removing stray tokens during the data‑processing stage eliminates this static noise, allowing the model to focus on valuable linguistic patterns and rare information that truly matter for learning.'},
 {'instruction': 'True or false: Fuzzy deduplication can remove duplicate documents that are not identical.',
  'input': '',
  'output': 'True. Fuzzy deduplication is specifically designed to detect and remove documents that are almost the same rather than requiring an exact match. It identifies near‑identical copies, such as the same news article reproduced across multiple sites with minor variations, and removes them to prevent redundancy.'},
 {'instruction': 'How does the cleaning stage balance the removal of useless junk while preserving valuable, rare information?',
  'input': '',
  'output': 'The cleaning stage operates with the dual goal of eliminating useless bits and retaining valuable, rare information. It begins by stripping formatting junk, trailing spaces, and invisible characters, which are universally recognized as non‑informative and thus safely removed. Next, it applies deduplication—both exact and fuzzy—to discard repetitive content that would cause models to memorize rather than learn, while using techniques like MinHashLSH to ensure that only near‑identical copies are eliminated, preserving unique variations. Stray tokens, which act as static noise, are also removed to prevent confusion during learning. Throughout these steps, the pipeline carefully monitors the content that remains, ensuring that rare but informative elements are not mistakenly classified as junk. By focusing on universal noise patterns and employing similarity‑aware deduplication, the cleaning stage filters out the bulk of irrelevant material while safeguarding the distinctive, low‑frequency data that enriches model understanding. This balanced approach yields a refined dataset that is both clean and rich in valuable information.'},
 {'instruction': 'Define what a uniform structure for training data means and why it is required for model training.',
  'input': '',
  'output': 'A uniform structure for training data refers to a consistent and predictable arrangement of information in each example that the model will ingest during the training process. Uniformity means that every data point follows the same pattern of fields, delimiters, and ordering, such that the model can reliably locate the instruction, response, or any auxiliary elements without ambiguity. This requirement arises because neural language models learn statistical relationships across large corpora; any deviation from the expected layout forces the model to allocate capacity to interpret irregularities instead of focusing on the semantic content. When the structure is uniform, the model can efficiently map the position of an instruction to its corresponding response, reinforcing the intended mapping during gradient updates. Conversely, irregular formatting introduces noise that competes with the learning signal, leading to slower convergence or degraded performance. Uniform structure also simplifies preprocessing pipelines, allowing automated tokenization and batching without custom handling for outlier cases. In summary, a uniform structure is a foundational data‑quality principle that ensures the model receives clear, repeatable patterns, thereby maximizing learning efficiency and reducing wasted computational effort caused by inconsistent formatting.'},
 {'instruction': 'How does instruction fine‑tuning differ from preference fine‑tuning in terms of required data format?',
  'input': '',
  'output': 'Instruction fine‑tuning and preference fine‑tuning employ distinct data formats that reflect their training objectives. Instruction fine‑tuning requires pairs composed of an instruction and a single response. Each example follows the (instruction, response) format, supplying the model with a direct mapping that it should learn to reproduce. In contrast, preference fine‑tuning relies on triples that include an instruction, a winning response, and a losing response. The (instruction, winning_response, losing_response) format provides comparative information, allowing the model to discern which answer aligns better with desired criteria. The presence of both winning and losing responses enables a preference model to learn a ranking or reward signal, rather than merely copying a single answer. Consequently, instruction fine‑tuning emphasizes exact reproduction of a target response, while preference fine‑tuning emphasizes relative quality assessment across alternatives. The differing formats also affect downstream steps: instruction fine‑tuning can be evaluated directly by checking response fidelity, whereas preference fine‑tuning often integrates a reward model that scores responses based on the comparative signal supplied in the triples. Thus, the core distinction lies in the number and role of responses attached to each instruction.'},
 {'instruction': 'Why is ensuring correct formatting considered a part of data quality, and what are the consequences of incorrect formatting?',
  'input': '',
  'output': "Ensuring correct formatting is integral to data quality because the model's learning process depends on recognizing and exploiting consistent patterns. When formatting adheres to the expected layout—such as proper placement of colons, correct newline usage, and uniform delimiters—the model can swiftly parse each element (instruction, response, or triple) without expending resources on error correction. Incorrect formatting, exemplified by misplaced newlines or missing colons, disrupts this parsing. The model then misinterprets the boundaries between fields, leading to wasted computational effort as it attempts to reconstruct the intended pattern. This wasted effort reduces effective learning capacity, slows convergence, and can introduce spurious associations that degrade final performance. Moreover, inconsistent formatting hampers automated preprocessing pipelines, requiring additional cleaning steps that increase engineering overhead and risk further data loss. By treating formatting as a data‑quality metric, teams enforce a disciplined approach that preserves the signal‑to‑noise ratio, ensures that every token contributes meaningfully to the target behavior, and prevents the model from learning to accommodate malformed examples. In sum, correct formatting safeguards both computational efficiency and the integrity of the learned representations, while incorrect formatting undermines these goals through unnecessary interpretation work and potential model confusion."},
 {'instruction': 'List the components required for instruction fine‑tuning and for preference fine‑tuning, and explain the role of each component.',
  'input': '',
  'output': 'Instruction fine‑tuning requires two components: (1) the instruction, which specifies the task or query the model should address, and (2) the response, which provides the correct answer or behavior the model must emulate. The instruction guides the model’s attention toward the intended context, while the response supplies the target output that the model learns to generate during supervised updates. Preference fine‑tuning requires three components: (1) the instruction, identical in purpose to that used in instruction fine‑tuning, (2) the winning response, which represents the preferred or higher‑quality answer, and (3) the losing response, which exemplifies a less desirable answer. The winning and losing responses together create a comparative signal; the model learns to assign higher probability or reward to the winning response when presented with the same instruction. This comparative structure enables the model to develop a sense of preference, rather than merely copying a single answer. In both settings, the instruction anchors the learning context, while the response elements supply the performance target—either a single correct output or a ranking between two alternatives. The distinct component sets align with the differing objectives: direct imitation for instruction fine‑tuning versus preference‑based ranking for preference fine‑tuning.'},
 {'instruction': 'Explain how building a reward model may involve scoring each response and how this relates to preference fine‑tuning.',
  'input': '',
  'output': 'Building a reward model typically involves assigning a numerical score to each response generated for a given instruction. The scoring process quantifies the desirability of an answer based on criteria such as relevance, correctness, or alignment with human preferences. In the context of preference fine‑tuning, the reward model uses the (instruction, winning_response, losing_response) triples as training data. For each triple, the model learns to give a higher score to the winning response than to the losing response. By repeatedly adjusting its parameters to increase the score gap, the reward model internalizes the comparative judgments embedded in the triples. Once trained, the reward model can evaluate any new response by producing a scalar value that reflects its estimated quality. This score can then guide reinforcement learning or further fine‑tuning steps, where the language model is optimized to maximize the expected reward. Thus, scoring each response is the core mechanism that translates the qualitative preference information from the triples into a quantitative signal, enabling systematic improvement of the language model’s output according to the preferences encoded in the training data.'},
 {'instruction': 'True or false: Preference fine‑tuning can be performed using only (instruction, response) pairs.',
  'input': '',
  'output': 'False. Preference fine‑tuning specifically requires triples that contain an instruction together with both a winning response and a losing response. The format (instruction, winning_response, losing_response) provides the comparative information necessary for the model to learn which of two answers is preferred. Using only (instruction, response) pairs would supply a single answer per instruction, which aligns with the data requirements of instruction fine‑tuning, not preference fine‑tuning. Therefore, the statement that preference fine‑tuning can be performed with only (instruction, response) pairs contradicts the defined data format for that training paradigm.'},
 {'instruction': "Compare the impact of misplaced newlines versus missing colons on a model's ability to interpret training examples.",
  'input': '',
  'output': 'Misplaced newlines and missing colons both constitute formatting errors, yet they affect model interpretation in subtly different ways. A misplaced newline disrupts the line‑based segmentation of fields. When the model expects a newline to separate the instruction from the response, an extra or omitted newline can cause the instruction and response to merge or split incorrectly, making it ambiguous where one component ends and the next begins. This ambiguity forces the model to expend additional computation to infer boundaries, often leading to misassigned tokens and reduced learning efficiency. Missing colons, on the other hand, eliminate a clear delimiter that signals the start of a new field, such as the transition from "Instruction:" to the actual instruction text. Without the colon, the model may treat the label and content as a single token sequence, obscuring the structural cue that differentiates metadata from content. Both errors cause the model to waste effort interpreting broken patterns, but misplaced newlines primarily interfere with vertical (line‑based) separation, while missing colons disrupt horizontal (character‑based) delimitation. In practice, either error degrades data quality, yet the specific parsing difficulty differs: newlines affect line breaks, whereas colons affect token‑level field identification.'},
 {'instruction': 'Define the six dimensions of quality for fine‑tuning data according to Chip Huyen’s book and explain why each dimension matters.',
  'input': '',
  'output': 'Chip Huyen’s book outlines six distinct dimensions of quality for fine‑tuning data, treating each dimension as a separate axis that contributes to overall dataset effectiveness. The first dimension concerns the correctness of the content, ensuring that every example reflects accurate information. The second dimension addresses relevance, meaning the data must directly support the target tasks the model will perform. The third dimension focuses on diversity, requiring a range of phrasing, topics, and contexts so the model does not over‑fit to a narrow style. The fourth dimension involves clarity, where examples are free of ambiguity and clearly communicate intent. The fifth dimension is consistency, guaranteeing that labeling and formatting follow a uniform standard throughout the set. The sixth dimension emphasizes completeness, making sure that the data collectively covers the full spectrum of situations the model may encounter. By evaluating each fine‑tuning example against these six criteria, practitioners can systematically gauge whether the data will help the model generalize, avoid brittleness, and produce reliable outputs across varied inputs. Neglecting any single dimension can create gaps that undermine the model’s performance, while strengthening all six creates a robust foundation for high‑quality model behavior.'},
 {'instruction': 'Compare the findings of the LIMA paper with those of the Yi model team regarding the relationship between data quantity and data quality.',
  'input': '',
  'output': 'The LIMA paper demonstrated that a modest set of 1,000 carefully chosen examples was enough to make a 65‑billion‑parameter model generate answers preferred over GPT‑4 nearly half the time, highlighting the power of extremely high‑quality, curated data. In contrast, the Yi model team reported that 10,000 clean instructions—still a relatively small number—outperformed hundreds of thousands of sloppy examples, reinforcing the principle that clean, well‑structured data can outweigh sheer volume. Both studies converge on the insight that data quality can dominate quantity: the LIMA result shows that one thousand premium examples can rival a far larger, less refined corpus, while the Yi result quantifies a ten‑fold advantage of clean instructions over a massive pool of noisy ones. Together, these findings challenge the assumption that larger datasets automatically produce better models; instead, they suggest that strategic selection and purification of examples yield disproportionately strong performance gains. The evidence from both works underlines the practical advantage of investing effort in curating high‑quality data rather than merely amassing vast but noisy collections.'},
 {'instruction': 'Explain why fewer high‑quality examples can be more effective than many low‑quality ones.',
  'input': '',
  'output': 'Fewer high‑quality examples can be more effective than many low‑quality ones because they provide clear, accurate signals that guide the model toward the desired behavior without introducing confusion. High‑quality data ensures correctness, relevance, and consistency, allowing the model to internalize reliable patterns during training. When low‑quality data dominates, the model encounters contradictory or ambiguous signals, which can dilute learned representations and lead to brittleness. The LIMA paper’s result—1,000 carefully chosen examples outperforming far larger noisy sets—illustrates that a small, curated collection can produce answers preferred over GPT‑4 nearly half the time. Similarly, the Yi model team found that 10,000 clean instructions beat hundreds of thousands of sloppy ones, confirming that clean, well‑structured data outweighs sheer quantity. These observations reinforce the broader principle that “fewer high‑quality examples can be more effective than many low‑quality ones,” because quality alone shapes the internal language representations more decisively than volume. By focusing on meticulous curation, practitioners reduce the risk of the model learning undesirable patterns, thereby achieving stronger performance with less data.'},
 {'instruction': 'Define data coverage and describe how it should be balanced with data quality when constructing a training set.',
  'input': '',
  'output': 'Data coverage refers to ensuring that a training set contains a representative mix of examples that span the range of situations a model may encounter. Coverage means the dataset reflects both common and rare cases, allowing the model to handle typical inputs and edge‑case scenarios alike. Quality alone is insufficient; a dataset must also cover many situations, or the model will fail on slightly unusual inputs. To balance coverage with quality, practitioners allocate examples proportionally to real‑world frequencies: if 80\u202f% of customer questions involve shipping, roughly 80\u202f% of the training data should address shipping, maintaining relevance to the dominant use case. At the same time, the remaining 20\u202f% should be devoted to rare, difficult cases such as dialects, contradictory queries, typos, and emojis, which constitute the “long tail.” These rare examples, though infrequent, are essential for robustness because they expose the model to diverse language patterns. By combining high‑quality examples across both high‑frequency and low‑frequency domains, the training set achieves comprehensive coverage while preserving the clarity and correctness that high‑quality data provides. This synergy ensures the model does not become brittle on unusual inputs and can generalize effectively across the full spectrum of user queries.'},
 {'instruction': 'What does the term “long tail” refer to in the context of training data, and why is it important to include long‑tail examples?',
  'input': '',
  'output': 'In the context of training data, the “long tail” refers to rare, difficult cases that occur infrequently in real‑world usage. These cases include dialectal variations, contradictory queries, typographical errors, and emojis—situations that lie far from the high‑frequency core of user inputs. Although each individual long‑tail example appears seldom, the aggregate of these rare events can represent a significant portion of the diversity a model must handle. Including long‑tail examples is important because quality alone does not guarantee robustness; without coverage of these infrequent scenarios, the model can fail on slightly unusual inputs. Adequate coverage of the long tail makes a model robust instead of brittle, ensuring it can gracefully process unexpected or unconventional language. Moreover, when the training set mirrors the distribution of real‑world queries—including both dominant categories like shipping and the long‑tail spectrum—the model learns to generalize across the full range of inputs. Thus, long‑tail inclusion is a critical complement to high‑quality, high‑frequency data, jointly delivering a balanced dataset that prepares the model for both common and edge‑case interactions.'},
 {'instruction': 'Explain the purpose of Feature Activation Coverage and how it contributes to dataset diversity.',
  'input': '',
  'output': 'Feature Activation Coverage is a technique that measures diversity in the model’s internal language representations. By tracking which internal features become active across different training examples, practitioners can assess whether the dataset exercises a broad set of linguistic patterns and concepts. When feature activation is spread widely, the model experiences varied internal states, indicating that the data covers diverse contexts, topics, and linguistic forms. This measurement helps identify gaps where certain features remain dormant, signaling under‑represented aspects of language that could lead to brittleness if left unaddressed. By using Feature Activation Coverage, data curators can intentionally add examples that stimulate missing features, thereby enriching the dataset’s diversity. This process aligns with the broader principle that adequate coverage—both in terms of external scenarios and internal representation diversity—makes a model robust. In summary, Feature Activation Coverage provides a quantitative lens on how well the training data activates the model’s internal mechanisms, guiding the inclusion of high‑quality examples that broaden internal coverage and enhance overall model performance.'},
 {'instruction': 'Describe combinatorial coverage and its role in identifying missing combinations of attributes in a dataset.',
  'input': '',
  'output': 'Combinatorial coverage is a technique that helps identify missing combinations of attributes in the data. It works by enumerating the possible intersections of different attribute values—such as topic, language style, punctuation, or user intent—and checking whether the training set contains examples for each intersection. If a particular combination, for instance a shipping‑related query phrased with emojis and containing a typo, is absent, combinatorial coverage flags this gap. By surfacing missing attribute combinations, the technique guides data engineers to augment the dataset with targeted examples that fill the void, thereby improving overall coverage. This systematic approach complements other coverage strategies, such as ensuring the proportion of shipping queries matches real‑world frequencies and adding rare, difficult cases that constitute the long tail. When combinatorial coverage is applied alongside Feature Activation Coverage, the dataset gains both external diversity (across attribute combinations) and internal diversity (across model representations). The result is a more robust training set that reduces brittleness and equips the model to handle a broader array of inputs.'},
 {'instruction': 'True or false: A model trained only on high‑frequency shipping queries will reliably handle dialectal queries that contain emojis.',
  'input': '',
  'output': 'False. Training a model solely on high‑frequency shipping queries provides strong coverage for common customer concerns but neglects the rare, difficult cases that form the long tail, such as dialectal variations, contradictory queries, typographical errors, and emojis. Quality alone is insufficient; a dataset must also cover many situations, or the model will fail on slightly unusual inputs. Without examples that include dialects and emojis, the model will lack exposure to the internal language representations needed to process those forms, leading to brittleness. Adequate coverage that incorporates both dominant categories and long‑tail cases is required for robustness, ensuring the model can handle both typical shipping queries and atypical dialectal queries with emojis.'},
 {'instruction': 'Why is adequate coverage essential for turning a model from brittle to robust?',
  'input': '',
  'output': 'Adequate coverage is essential for turning a model from brittle to robust because it ensures the training set reflects the full spectrum of inputs the model may encounter. When coverage is limited to high‑frequency scenarios, the model learns strong patterns for those cases but remains unexposed to rare or unusual inputs, making it brittle—susceptible to errors on slightly atypical queries. By including a representative mix of examples, including both common situations (e.g., 80\u202f% shipping questions) and long‑tail cases such as dialects, contradictory queries, typos, and emojis, the model experiences a wider range of linguistic patterns during training. This exposure trains internal language representations to handle variability, as measured by techniques like Feature Activation Coverage, and helps identify missing attribute combinations through combinatorial coverage. The result is a model that maintains performance across diverse inputs rather than collapsing when faced with edge cases. In essence, adequate coverage transforms a narrow, over‑fitted system into a flexible, generalizable one, embedding robustness directly into the data foundation.'},
 {'instruction': 'Summarize how the principles of quality, coverage, and specialized techniques like Feature Activation Coverage and combinatorial coverage together create a robust fine‑tuning dataset.',
  'input': '',
  'output': 'Creating a robust fine‑tuning dataset requires intertwining three core principles: quality, coverage, and specialized measurement techniques. Quality, as emphasized by the LIMA paper and the Yi model team, means selecting clean, well‑structured examples that convey correct information, relevance, and consistency; a small set of high‑quality data can outperform vastly larger, noisy collections. Coverage expands beyond quality by ensuring the dataset represents the full distribution of real‑world inputs: proportionally reflecting dominant topics (e.g., 80\u202f% shipping questions) while also embedding rare, difficult cases—the long tail—such as dialects, contradictory queries, typos, and emojis. Without coverage, even perfect quality leaves the model brittle on unusual inputs. Feature Activation Coverage quantifies internal diversity by tracking which language representations fire across examples, revealing under‑represented internal states that need enrichment. Combinatorial coverage complements this by systematically checking that all attribute intersections (topic, style, punctuation, intent) appear in the data, pinpointing absent combinations. Together, high‑quality examples supply clear learning signals, comprehensive coverage supplies the breadth of scenarios, and the two coverage techniques provide diagnostic feedback to fill gaps. This integrated approach ensures the model internalizes both accurate patterns and diverse representations, resulting in a robust system that performs reliably across common and long‑tail queries.'}]

Saving the Dataset

pd.DataFrame(all_pairs) converts the list of dicts into a tabular view. The df.head() call below confirms the column structure is correct — instruction, input, and output as expected for Alpaca format.

python

import pandas as pd

df = pd.DataFrame(all_pairs)
df.head()

Out[46]:

	instruction	output
0	What is the relationship between data quality ...	Data quality is the primary determinant of a l...
1	How does training on poor data affect computat...	Training on poor data consumes computational r...
2	Why does a language model's personality, skill...	A language model internalizes patterns, facts,...
3	List the key transformations that data enginee...	Data engineering applies a series of transform...
4	Compare the outcomes of training a language mo...	Training a language model with high‑quality da...

Both CSV and JSON exports use a timestamp suffix to avoid overwriting previous runs. JSON is more portable for loading in Hugging Face datasets; CSV opens immediately in any spreadsheet tool for manual review.

python

from datetime import datetime

def get_datetime():
    return datetime.now().strftime("%Y%m%d-%H%M%S")

df.to_csv(f"../data/synthetic_dataset-{get_datetime()}.csv", index=False)
df.to_json(f"../data/synthetic_dataset-{get_datetime()}.json", orient="records", indent=2)

This pipeline is a starting point, not a finished recipe. A few issues come up in practice:

Chunks that span section boundaries badly — PDF conversion doesn't always respect heading structure, especially in documents with complex layouts. Spot-checking topic_wise before Stage 2 catches the worst cases.

Fact redundancy across chunks — the same claim can appear in multiple sections of a long document. Running deduplication on golden_facts before Stage 3 reduces repetition in the training set.

Answer length variance — the 200–300 word target is a soft constraint. Some facts simply don't support a 200-word answer without padding. Filtering generated pairs by token count after the fact is more reliable than fighting the prompt.

The output reflects what your documents actually say — not what a general-purpose model guesses they should say. That specificity is the point.

Building LLM Fine-Tuning Data Without Hand-Labeling a Single Example

Stage 1: Parsing Source Documents

Converting PDFs to Markdown

Splitting into Topic Chunks

Stage 2: Extracting Atomic Facts

Setting Up the LLM

Defining the Output Schema

The Extraction Prompt

Extracting Facts from All Chunks

Stage 3: Generating QA Pairs

The Alpaca Schema

The Generation Prompt

Saving the Dataset

Stay in the loop

Related Posts

Comments (0)

Leave a comment