~/blog

Multimodal RAG: Building a Smart Document Assistant - 100% Local

May 11, 2026•12 min read•By Mohammed Vasim

pythonqdrantlanggraphmultimodalrag

Welcome to this comprehensive guide on building a multimodal Retrieval-Augmented Generation (RAG) system! In this tutorial, we'll create an AI assistant that can understand and retrieve information from both text and images in documents. Think of it as giving your AI the ability to "see" and "read" documents simultaneously.

What We're Building

Imagine you have a research paper with complex diagrams, mathematical equations, and detailed explanations. Traditional text-based RAG systems can only search through the written content, but our multimodal system can search across both the text and the visual elements. When you ask about "transformer architecture," it might retrieve both the textual explanation and the relevant architectural diagram.

Tech Stack Overview

We're using some powerful tools to make this happen:

uv for efficient Python project management
LangChain to orchestrate our AI workflow
Qdrant DB for lightning-fast vector similarity search
BGE-VL for encoding both text and images into searchable vectors
Ollama for advanced AI model serving

Let's dive in and build this step by step!

Setting Up Our Environment

Before we can start building our multimodal RAG system, we need to install all the necessary libraries. This is like gathering all the ingredients before cooking a complex meal. Each library has a specific role:

langchain and langchain-huggingface: The orchestration framework that will coordinate our AI workflow
sentence-transformers: For creating text embeddings
pymupdf4llm: A specialized tool for extracting text and images from PDFs
langchain-nvidia-ai-endpoints and langchain-openai: For connecting to different AI model providers
qdrant-client: Our vector database client for storing and searching embeddings
pymupdf: The core PDF processing library
langchain-text-splitters: For breaking down long documents into manageable chunks

Let's install everything we need:

python

%pip install langchain \
     langchain-huggingface \
     sentence-transformers \
     pymupdf4llm \
     langchain-nvidia-ai-endpoints \
     langchain-openai \
     qdrant-client \
     pymupdf \
     langchain-text-splitters

Importing Our Tools

Now that we have all our libraries installed, let's import the specific tools we'll need for this project. We're bringing in:

os: For file system operations
pymupdf.layout: Activates the layout analysis in PyMuPDF for better PDF processing
pymupdf4llm: Our PDF-to-markdown converter that can extract both text and images
PIL.Image: For working with images in Python

These imports are like assembling our toolkit before starting the actual work.

python

import os
import pymupdf.layout  # activate PyMuPDF-Layout in pymupdf
import pymupdf4llm
from PIL import Image

Getting Our Sample Document

For this tutorial, we'll use the famous "Attention is All You Need" paper - the groundbreaking work that introduced the Transformer architecture that powers most modern AI systems. This paper is perfect for our multimodal RAG because it contains:

Complex mathematical equations
Architectural diagrams
Detailed explanations of transformer mechanisms
Visual representations of attention mechanisms

We're downloading it directly from arXiv using wget and saving it with a more readable filename.

python

!wget https://arxiv.org/pdf/1706.03762.pdf -O attention_is_all_you_need.pdf

Checking Our Download and Setting Up Directories

Let's verify that our PDF downloaded successfully and then create the necessary directory structure for our project. We'll need:

A data directory to store our processed content
A data/images subdirectory to hold the extracted images from the PDF

This organization will keep our workspace clean and make it easy to manage our multimodal data.

python

!ls

python

!mkdir "data"
!mkdir "data/images"
!ls

Processing the PDF: Extracting Text and Images

This is where the magic starts! We're using pymupdf4llm to convert our PDF into markdown format while simultaneously extracting all the images.

The to_markdown() function will:

Read through each page of the PDF
Extract the text content and convert it to clean markdown
Identify and save all images to our data/images/ directory
Maintain the document structure and formatting

After processing, we'll have:

md_text: The complete text content in markdown format
images: A list of file paths to all extracted images

This multimodal extraction is crucial because it preserves both the textual information and the visual context that might be essential for understanding complex topics like transformer architectures.

python

filepath = "attention_is_all_you_need.pdf"
image_save_path = "data/images/"

# The remainder of the script is unchanged

md_text = pymupdf4llm.to_markdown(filepath, image_path=image_save_path, write_images=True)
images = [os.path.join(image_save_path, image_path) for image_path in os.listdir(image_save_path)]

Inspecting Our Extracted Content

Let's take a look at what we've extracted from the PDF. These display cells will show us:

The markdown text content - this should contain all the readable text from the paper
The list of image file paths - these are the figures, diagrams, and charts extracted from the document
A directory listing to confirm our images were saved properly

This is a good checkpoint to verify that our PDF processing worked correctly before moving on to the embedding phase.

python

md_text

python

images

python

!ls "data/images/"

Setting Up Our Multimodal Embedding Model

Now we need to bring in the star of our show: the BGE-VL (Visual-Language) model from BAAI. This is a specialized embedding model that can encode both text and images into the same vector space, making it perfect for multimodal search.

Before we can download and use the model, we need to authenticate with HuggingFace, which hosts these models. We'll check if we have an HF_TOKEN environment variable set, and if not, we'll authenticate manually.

The BGE-VL model is particularly powerful because it understands the relationship between text and images, so when you search for "transformer architecture," it can retrieve both the textual description and the architectural diagram that best matches your query.

python

from sentence_transformers import SentenceTransformer

python

os.getenv("HF_TOKEN")

python

!hf auth login --token $HF_Token

python

!hf download BAAI/BGE-VL-base --local-dir ./BGE-VL-base

python

!ls

python

emb_model = SentenceTransformer(model_name_or_path="./BGE-VL-base")

Loading and Testing Our Embedding Model

With authentication complete, we can now download and load the BGE-VL model. This model will be our bridge between the textual and visual worlds.

After loading the model, we'll run some quick tests to make sure everything is working:

Encode a simple text string ("hello") to verify text embedding
Encode one of our extracted images to verify image embedding

These tests will confirm that our model can handle both modalities and produce consistent vector representations. The shape of these vectors will tell us the dimensionality of our embedding space.

python

emb_model.encode("hello").shape

python

emb_model.encode("data/images/1706.03762v7.pdf-0003-00.png").shape

Preparing Our Text for Embedding

Before we can embed our text, we need to break it down into manageable chunks. The entire "Attention is All You Need" paper is quite long, and embedding it as one massive string wouldn't be very useful for search.

We're using LangChain's RecursiveCharacterTextSplitter which intelligently splits the text at natural boundaries (paragraphs, sentences) while keeping related content together. This ensures that when we search for information, we get coherent, meaningful chunks rather than random text fragments.

Each chunk will become a searchable document in our vector database.

python

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

documents = [Document(page_content=md_text, metadata={"filename": filepath})]

text_splitter = RecursiveCharacterTextSplitter()
documents = text_splitter.split_documents(documents)

print(f"Total documents: {len(documents)}")

python

import tqdm
from uuid import uuid4

Generating Text Embeddings

Now comes the computationally intensive part: converting all our text chunks into vector embeddings. This process transforms human-readable text into mathematical vectors that capture semantic meaning.

For each document chunk, we're creating a dictionary containing:

A unique ID (using UUID)
The original text content
The vector embedding produced by our BGE-VL model

We're using tqdm to show a progress bar since this can take a while depending on the document length and computational resources. Each embedding operation requires running the text through the transformer model, which is why we want to do this once and store the results.

python

# Embedding text
txt_embeddings = []
for doc in tqdm.tqdm(documents, desc="Generating embeddings for documents..."):
    try:
        txt_embeddings.append(
            {
                "id": str(uuid4()),
                "content": doc.page_content,
                "vector": emb_model.encode(doc.page_content),
            }
        )
    except Exception as e:
        print(f"Error: {e}")

txt_embeddings[0]

python

txt_embeddings[0]["vector"].shape

Generating Image Embeddings

Just like we did with text, we now need to create embeddings for all the images we extracted from the PDF. This is where the multimodal aspect really shines - we're treating images with the same importance as text.

For each image file:

We load it using PIL (Python Imaging Library)
Pass it through our BGE-VL model to get a vector representation
Store the image path and its embedding

The beauty of BGE-VL is that it produces embeddings in the same vector space for both text and images, meaning we can search across both modalities simultaneously. A search for "attention mechanism" could return both the textual explanation and the visual diagram showing how attention works.

python

# Embedding images
img_embeddings = []
for img_path in tqdm.tqdm(images, desc="Generating embeddings for images..."):
    try:
        img_embeddings.append(
            {
                "image_path": img_path,
                "vector": emb_model.encode(Image.open(img_path)),
            }
        )
    except Exception as e:
        print(f"Error: {e}")

print(f"Number of images embedded: {len(img_embeddings)}")

img_embeddings[0]["vector"]

python

img_embeddings[0]["vector"].shape

python

img_embeddings[0].keys()

Setting Up Our Vector Database

With all our embeddings generated, we need a way to store and search through them efficiently. This is where Qdrant comes in - it's a high-performance vector database designed specifically for similarity search.

We're creating an in-memory instance for this demo (using :memory:), but in production you'd typically use a persistent Qdrant server. The collection will store both our text and image embeddings in a single vector space, allowing us to search across all content types simultaneously.

The collection is configured with:

A single dense vector space
Cosine similarity as our distance metric (good for normalized embeddings)
The same dimensionality as our BGE-VL embeddings

python

from qdrant_client import QdrantClient, models

# docker run -p 6333:6333 qdrant/qdrant
client = QdrantClient(":memory:")

python

COLLECTION_NAME = "mm_rag_collection"

if not client.collection_exists(COLLECTION_NAME):
    client.create_collection(
        collection_name=COLLECTION_NAME,
        # Creating one vector space having same dimension for text and image.
        vectors_config={
            "dense": models.VectorParams(size=len(img_embeddings[0]["vector"]), distance=models.Distance.COSINE)
        }
    )

Populating Our Vector Database

Now we upload all our embeddings to Qdrant. We're doing this in two separate operations:

Image embeddings: Each image gets stored with its file path and vector representation
Text embeddings: Each text chunk gets stored with its content and vector

Each point in our database will have:

A unique ID
The vector embedding
Payload data (either image path or text content)

This unified storage allows us to search across both text and images in a single query, which is the key innovation of multimodal RAG.

python

# Upload image
client.upload_points(
    collection_name=COLLECTION_NAME,
    points=[
        models.PointStruct(
            id=uuid4(),
            vector={
                "dense": emb.pop("vector"),
            },
            payload=emb
        )
        for emb in img_embeddings
    ]
)

python

# Upload text
client.upload_points(
    collection_name=COLLECTION_NAME,
    points=[
        models.PointStruct(
            id=uuid4(),
            vector={
                "dense": emb.pop("vector"),
            },
            payload=emb
        )
        for emb in txt_embeddings
    ]
)

Testing Our Multimodal Search

Let's test our system! We're searching for "Transformer architecture" by encoding this query text into a vector and finding the most similar content in our database.

This demonstrates the core functionality: when a user asks about transformer architecture, our system can retrieve the most relevant information, whether it's text explaining the concept or diagrams showing the architecture.

The result will show us what's most similar to our query in the vector space.

python

query = emb_model.encode("Transformer architecture")

points = client.query_points(
    collection_name=COLLECTION_NAME,
    query=query,
    using="dense",
    limit=1
).points

python

len(points)

python

print(points[0].payload["content"])

Setting Up Our AI Chat Model

Now that we have our retrieval system working, we need an AI model that can understand user questions and use our retrieval tool to get relevant context. We're using Ollama, which allows us to run large language models locally.

Ollama is like having your own AI server - it downloads and runs models like GPT but on your own hardware. We're installing it and downloading the Qwen 3.5 9B model, which is an excellent open-source chat model that performs well on technical questions.

This local setup gives us privacy (no data sent to external APIs) and potentially better performance for our specific use case.

python

!sudo apt update
!apt-get install zstd pciutils -y 
!curl -fsSL https://ollama.com/install.sh | sh

python

import threading
import subprocess
import time

def run_ollama_serve():
  subprocess.Popen(["ollama", "serve"])

thread = threading.Thread(target=run_ollama_serve)
thread.start()
time.sleep(5)

python

# !ollama pull deepseek-r1:32b
# !ollama pull deepseek-r1:8b
!ollama pull qwen3.5:9b

python

!pip install -q langchain-ollama

Creating Our Retrieval Tool

Now we need to create a tool that our AI agent can use to search through our multimodal content. This retrieve_multimodal_context function takes a user query, encodes it into a vector, and retrieves the most relevant text and image content.

The function returns structured data that the AI can use to provide informed answers. When the AI model doesn't know something about transformers, it can call this tool to get accurate, up-to-date information from our document.

This is the bridge between our vector search system and the conversational AI.

python

# Retrieval tool


def retrieve_multimodal_context(query, limit=3):
    """Fetch context if required"""
    try:
        points = client.query_points(
            collection_name=COLLECTION_NAME,
            query=emb_model.encode(query),
            using="dense",
            limit=limit
        ).points

        print(f"Fetched: {len(points)}")

        if points:
            payloads = [item.payload for item in points]
            return payloads

        return f"No context available for query: {query}"
    except Exception as e:
        return f"Error: {e}"

python

print(retrieve_multimodal_context("transformer"))

Building Our Intelligent Agent

Here's where everything comes together! We're creating an AI agent using LangChain that has access to our multimodal retrieval tool. The agent is instructed that when users ask about transformers, it MUST use the tool to retrieve context.

This agent architecture allows the AI to:

Understand user questions
Recognize when it needs external knowledge
Call our retrieval function to get relevant information
Provide informed answers based on the actual document content

The system prompt ensures the agent knows when and how to use the retrieval capability.

python

from os import getenv
from langchain.agents import create_agent
from langchain.chat_models import init_chat_model

llm = init_chat_model(
    # model="ollama:deepseek-r1:8b",
    model="ollama:qwen3.5:9b",
    base_url="http://127.0.0.1:11434",
)

python

llm.invoke("hi")

python

agent = create_agent(
    model=llm,
    tools=[retrieve_multimodal_context],
    system_prompt="You are an helpful assistant. If you asked questions about transformers, you must use the tool to retrieve context."
)

python

agent.invoke(
    {
        "messages": [
            {
                "role": "user",
                "content": "What do you know about transformer architecure",
            }
        ]
    }
)

Making It Interactive: Streaming Responses

The final piece of our multimodal RAG system is the streaming response functionality. This allows users to see the AI's response in real-time as it's generated, creating a more engaging conversational experience.

The stream_response function handles the complex streaming logic:

It monitors the agent's thought process (tool calls)
It captures the actual text responses
It provides real-time output to the user

When you ask about transformer architecture, you'll see the AI "think" (potentially calling the retrieval tool) and then provide a streaming response based on the retrieved context.

This creates a seamless experience where the AI appears to have deep knowledge of the document while actually retrieving information on-demand.

python

def stream_response(messages):
    """Stream AI response"""

    response = ""

    for chunk, _ in agent.stream(
        messages,
        stream_mode="messages",
        config={"configurable": {"thread_id": "sample-123"}},
    ):
        if not chunk.content_blocks:
            continue

        # Message chunk
        last_message = chunk.content_blocks[0]
        # print(f"Chunk: {last_message}")
        if last_message.get("type") == "tool_call_chunk":
            print("Reasoning", end="\r")


        # When agent start writes, last message will have keys type: text
        if last_message.get("type") == "text":
            chunk_content = last_message.get("text")
            if chunk_content and "tool_call_output" not in chunk_content:
                response += chunk_content
                print(chunk_content, end="")

    # print(response)

Demo: Testing Our Complete System

Now let's put it all together! We're creating a sample conversation and streaming the response to demonstrate our complete multimodal RAG system in action.

When you run this, you'll see:

The AI processing your question about transformer architecture
Potentially calling the retrieval tool to get relevant context
Streaming back a response that's informed by the actual document content

This demonstrates the full power of multimodal RAG - an AI that can search across both text and images to provide comprehensive, accurate answers about complex topics.

What We've Built

Congratulations! You've just created a sophisticated multimodal RAG system that can:

Extract content from PDFs while preserving both text and images
Generate embeddings for both modalities in a unified vector space
Search efficiently across text and images simultaneously
Provide intelligent answers using a local AI model with real-time retrieval

This system could be extended to handle multiple documents, different file types, or even web content. The possibilities are endless!

Next Steps

To make this production-ready, you might want to:

Add persistent vector storage (replace in-memory Qdrant)
Implement document preprocessing pipelines
Add support for more file formats
Create a web interface for easier interaction
Add conversation memory for multi-turn dialogues

The foundation you've built here is solid and can scale to handle much more complex use cases.

python

messages = {
    "messages": [
        {
            "role": "user",
            "content": "What do you know about transformer architecure",
        }
    ]
}

python

stream_response(messages)

Multimodal RAG: Building a Smart Document Assistant - 100% Local

What We're Building

Tech Stack Overview

Setting Up Our Environment

Importing Our Tools

Getting Our Sample Document

Checking Our Download and Setting Up Directories

Processing the PDF: Extracting Text and Images

Inspecting Our Extracted Content

Setting Up Our Multimodal Embedding Model

Loading and Testing Our Embedding Model

Preparing Our Text for Embedding

Generating Text Embeddings

Generating Image Embeddings

Setting Up Our Vector Database

Populating Our Vector Database

Testing Our Multimodal Search

Setting Up Our AI Chat Model

Creating Our Retrieval Tool

Building Our Intelligent Agent

Making It Interactive: Streaming Responses

Demo: Testing Our Complete System

What We've Built

Next Steps

Comments (0)

Leave a comment