~/blog
Multimodal RAG: Building a Smart Document Assistant - 100% Local
Welcome to this comprehensive guide on building a multimodal Retrieval-Augmented Generation (RAG) system! In this tutorial, we'll create an AI assistant that can understand and retrieve information from both text and images in documents. Think of it as giving your AI the ability to "see" and "read" documents simultaneously.
What We're Building
Imagine you have a research paper with complex diagrams, mathematical equations, and detailed explanations. Traditional text-based RAG systems can only search through the written content, but our multimodal system can search across both the text and the visual elements. When you ask about "transformer architecture," it might retrieve both the textual explanation and the relevant architectural diagram.
Tech Stack Overview
We're using some powerful tools to make this happen:
- uv for efficient Python project management
- LangChain to orchestrate our AI workflow
- Qdrant DB for lightning-fast vector similarity search
- BGE-VL for encoding both text and images into searchable vectors
- Ollama for advanced AI model serving
Let's dive in and build this step by step!
Setting Up Our Environment
Before we can start building our multimodal RAG system, we need to install all the necessary libraries. This is like gathering all the ingredients before cooking a complex meal. Each library has a specific role:
- langchain and langchain-huggingface: The orchestration framework that will coordinate our AI workflow
- sentence-transformers: For creating text embeddings
- pymupdf4llm: A specialized tool for extracting text and images from PDFs
- langchain-nvidia-ai-endpoints and langchain-openai: For connecting to different AI model providers
- qdrant-client: Our vector database client for storing and searching embeddings
- pymupdf: The core PDF processing library
- langchain-text-splitters: For breaking down long documents into manageable chunks
Let's install everything we need:
%pip install langchain \
langchain-huggingface \
sentence-transformers \
pymupdf4llm \
langchain-nvidia-ai-endpoints \
langchain-openai \
qdrant-client \
pymupdf \
langchain-text-splittersImporting Our Tools
Now that we have all our libraries installed, let's import the specific tools we'll need for this project. We're bringing in:
- os: For file system operations
- pymupdf.layout: Activates the layout analysis in PyMuPDF for better PDF processing
- pymupdf4llm: Our PDF-to-markdown converter that can extract both text and images
- PIL.Image: For working with images in Python
These imports are like assembling our toolkit before starting the actual work.
import os
import pymupdf.layout # activate PyMuPDF-Layout in pymupdf
import pymupdf4llm
from PIL import ImageGetting Our Sample Document
For this tutorial, we'll use the famous "Attention is All You Need" paper - the groundbreaking work that introduced the Transformer architecture that powers most modern AI systems. This paper is perfect for our multimodal RAG because it contains:
- Complex mathematical equations
- Architectural diagrams
- Detailed explanations of transformer mechanisms
- Visual representations of attention mechanisms
We're downloading it directly from arXiv using wget and saving it with a more readable filename.
!wget https://arxiv.org/pdf/1706.03762.pdf -O attention_is_all_you_need.pdfChecking Our Download and Setting Up Directories
Let's verify that our PDF downloaded successfully and then create the necessary directory structure for our project. We'll need:
- A
datadirectory to store our processed content - A
data/imagessubdirectory to hold the extracted images from the PDF
This organization will keep our workspace clean and make it easy to manage our multimodal data.
!ls!mkdir "data"
!mkdir "data/images"
!lsProcessing the PDF: Extracting Text and Images
This is where the magic starts! We're using pymupdf4llm to convert our PDF into markdown format while simultaneously extracting all the images.
The to_markdown() function will:
- Read through each page of the PDF
- Extract the text content and convert it to clean markdown
- Identify and save all images to our
data/images/directory - Maintain the document structure and formatting
After processing, we'll have:
md_text: The complete text content in markdown formatimages: A list of file paths to all extracted images
This multimodal extraction is crucial because it preserves both the textual information and the visual context that might be essential for understanding complex topics like transformer architectures.
filepath = "attention_is_all_you_need.pdf"
image_save_path = "data/images/"
# The remainder of the script is unchanged
md_text = pymupdf4llm.to_markdown(filepath, image_path=image_save_path, write_images=True)
images = [os.path.join(image_save_path, image_path) for image_path in os.listdir(image_save_path)]Inspecting Our Extracted Content
Let's take a look at what we've extracted from the PDF. These display cells will show us:
- The markdown text content - this should contain all the readable text from the paper
- The list of image file paths - these are the figures, diagrams, and charts extracted from the document
- A directory listing to confirm our images were saved properly
This is a good checkpoint to verify that our PDF processing worked correctly before moving on to the embedding phase.
md_textimages!ls "data/images/"Setting Up Our Multimodal Embedding Model
Now we need to bring in the star of our show: the BGE-VL (Visual-Language) model from BAAI. This is a specialized embedding model that can encode both text and images into the same vector space, making it perfect for multimodal search.
Before we can download and use the model, we need to authenticate with HuggingFace, which hosts these models. We'll check if we have an HF_TOKEN environment variable set, and if not, we'll authenticate manually.
The BGE-VL model is particularly powerful because it understands the relationship between text and images, so when you search for "transformer architecture," it can retrieve both the textual description and the architectural diagram that best matches your query.
from sentence_transformers import SentenceTransformeros.getenv("HF_TOKEN")!hf auth login --token $HF_Token!hf download BAAI/BGE-VL-base --local-dir ./BGE-VL-base!lsemb_model = SentenceTransformer(model_name_or_path="./BGE-VL-base")Loading and Testing Our Embedding Model
With authentication complete, we can now download and load the BGE-VL model. This model will be our bridge between the textual and visual worlds.
After loading the model, we'll run some quick tests to make sure everything is working:
- Encode a simple text string ("hello") to verify text embedding
- Encode one of our extracted images to verify image embedding
These tests will confirm that our model can handle both modalities and produce consistent vector representations. The shape of these vectors will tell us the dimensionality of our embedding space.
emb_model.encode("hello").shapeemb_model.encode("data/images/1706.03762v7.pdf-0003-00.png").shapePreparing Our Text for Embedding
Before we can embed our text, we need to break it down into manageable chunks. The entire "Attention is All You Need" paper is quite long, and embedding it as one massive string wouldn't be very useful for search.
We're using LangChain's RecursiveCharacterTextSplitter which intelligently splits the text at natural boundaries (paragraphs, sentences) while keeping related content together. This ensures that when we search for information, we get coherent, meaningful chunks rather than random text fragments.
Each chunk will become a searchable document in our vector database.
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
documents = [Document(page_content=md_text, metadata={"filename": filepath})]
text_splitter = RecursiveCharacterTextSplitter()
documents = text_splitter.split_documents(documents)
print(f"Total documents: {len(documents)}")import tqdm
from uuid import uuid4Generating Text Embeddings
Now comes the computationally intensive part: converting all our text chunks into vector embeddings. This process transforms human-readable text into mathematical vectors that capture semantic meaning.
For each document chunk, we're creating a dictionary containing:
- A unique ID (using UUID)
- The original text content
- The vector embedding produced by our BGE-VL model
We're using tqdm to show a progress bar since this can take a while depending on the document length and computational resources. Each embedding operation requires running the text through the transformer model, which is why we want to do this once and store the results.
# Embedding text
txt_embeddings = []
for doc in tqdm.tqdm(documents, desc="Generating embeddings for documents..."):
try:
txt_embeddings.append(
{
"id": str(uuid4()),
"content": doc.page_content,
"vector": emb_model.encode(doc.page_content),
}
)
except Exception as e:
print(f"Error: {e}")
txt_embeddings[0]txt_embeddings[0]["vector"].shapeGenerating Image Embeddings
Just like we did with text, we now need to create embeddings for all the images we extracted from the PDF. This is where the multimodal aspect really shines - we're treating images with the same importance as text.
For each image file:
- We load it using PIL (Python Imaging Library)
- Pass it through our BGE-VL model to get a vector representation
- Store the image path and its embedding
The beauty of BGE-VL is that it produces embeddings in the same vector space for both text and images, meaning we can search across both modalities simultaneously. A search for "attention mechanism" could return both the textual explanation and the visual diagram showing how attention works.
# Embedding images
img_embeddings = []
for img_path in tqdm.tqdm(images, desc="Generating embeddings for images..."):
try:
img_embeddings.append(
{
"image_path": img_path,
"vector": emb_model.encode(Image.open(img_path)),
}
)
except Exception as e:
print(f"Error: {e}")
print(f"Number of images embedded: {len(img_embeddings)}")
img_embeddings[0]["vector"]img_embeddings[0]["vector"].shapeimg_embeddings[0].keys()Setting Up Our Vector Database
With all our embeddings generated, we need a way to store and search through them efficiently. This is where Qdrant comes in - it's a high-performance vector database designed specifically for similarity search.
We're creating an in-memory instance for this demo (using :memory:), but in production you'd typically use a persistent Qdrant server. The collection will store both our text and image embeddings in a single vector space, allowing us to search across all content types simultaneously.
The collection is configured with:
- A single dense vector space
- Cosine similarity as our distance metric (good for normalized embeddings)
- The same dimensionality as our BGE-VL embeddings
from qdrant_client import QdrantClient, models
# docker run -p 6333:6333 qdrant/qdrant
client = QdrantClient(":memory:")COLLECTION_NAME = "mm_rag_collection"
if not client.collection_exists(COLLECTION_NAME):
client.create_collection(
collection_name=COLLECTION_NAME,
# Creating one vector space having same dimension for text and image.
vectors_config={
"dense": models.VectorParams(size=len(img_embeddings[0]["vector"]), distance=models.Distance.COSINE)
}
)Populating Our Vector Database
Now we upload all our embeddings to Qdrant. We're doing this in two separate operations:
- Image embeddings: Each image gets stored with its file path and vector representation
- Text embeddings: Each text chunk gets stored with its content and vector
Each point in our database will have:
- A unique ID
- The vector embedding
- Payload data (either image path or text content)
This unified storage allows us to search across both text and images in a single query, which is the key innovation of multimodal RAG.
# Upload image
client.upload_points(
collection_name=COLLECTION_NAME,
points=[
models.PointStruct(
id=uuid4(),
vector={
"dense": emb.pop("vector"),
},
payload=emb
)
for emb in img_embeddings
]
)# Upload text
client.upload_points(
collection_name=COLLECTION_NAME,
points=[
models.PointStruct(
id=uuid4(),
vector={
"dense": emb.pop("vector"),
},
payload=emb
)
for emb in txt_embeddings
]
)Testing Our Multimodal Search
Let's test our system! We're searching for "Transformer architecture" by encoding this query text into a vector and finding the most similar content in our database.
This demonstrates the core functionality: when a user asks about transformer architecture, our system can retrieve the most relevant information, whether it's text explaining the concept or diagrams showing the architecture.
The result will show us what's most similar to our query in the vector space.
query = emb_model.encode("Transformer architecture")
points = client.query_points(
collection_name=COLLECTION_NAME,
query=query,
using="dense",
limit=1
).pointslen(points)print(points[0].payload["content"])Setting Up Our AI Chat Model
Now that we have our retrieval system working, we need an AI model that can understand user questions and use our retrieval tool to get relevant context. We're using Ollama, which allows us to run large language models locally.
Ollama is like having your own AI server - it downloads and runs models like GPT but on your own hardware. We're installing it and downloading the Qwen 3.5 9B model, which is an excellent open-source chat model that performs well on technical questions.
This local setup gives us privacy (no data sent to external APIs) and potentially better performance for our specific use case.
!sudo apt update
!apt-get install zstd pciutils -y
!curl -fsSL https://ollama.com/install.sh | shimport threading
import subprocess
import time
def run_ollama_serve():
subprocess.Popen(["ollama", "serve"])
thread = threading.Thread(target=run_ollama_serve)
thread.start()
time.sleep(5)# !ollama pull deepseek-r1:32b
# !ollama pull deepseek-r1:8b
!ollama pull qwen3.5:9b!pip install -q langchain-ollamaCreating Our Retrieval Tool
Now we need to create a tool that our AI agent can use to search through our multimodal content. This retrieve_multimodal_context function takes a user query, encodes it into a vector, and retrieves the most relevant text and image content.
The function returns structured data that the AI can use to provide informed answers. When the AI model doesn't know something about transformers, it can call this tool to get accurate, up-to-date information from our document.
This is the bridge between our vector search system and the conversational AI.
# Retrieval tool
def retrieve_multimodal_context(query, limit=3):
"""Fetch context if required"""
try:
points = client.query_points(
collection_name=COLLECTION_NAME,
query=emb_model.encode(query),
using="dense",
limit=limit
).points
print(f"Fetched: {len(points)}")
if points:
payloads = [item.payload for item in points]
return payloads
return f"No context available for query: {query}"
except Exception as e:
return f"Error: {e}"print(retrieve_multimodal_context("transformer"))Building Our Intelligent Agent
Here's where everything comes together! We're creating an AI agent using LangChain that has access to our multimodal retrieval tool. The agent is instructed that when users ask about transformers, it MUST use the tool to retrieve context.
This agent architecture allows the AI to:
- Understand user questions
- Recognize when it needs external knowledge
- Call our retrieval function to get relevant information
- Provide informed answers based on the actual document content
The system prompt ensures the agent knows when and how to use the retrieval capability.
from os import getenv
from langchain.agents import create_agent
from langchain.chat_models import init_chat_model
llm = init_chat_model(
# model="ollama:deepseek-r1:8b",
model="ollama:qwen3.5:9b",
base_url="http://127.0.0.1:11434",
)llm.invoke("hi")agent = create_agent(
model=llm,
tools=[retrieve_multimodal_context],
system_prompt="You are an helpful assistant. If you asked questions about transformers, you must use the tool to retrieve context."
)agent.invoke(
{
"messages": [
{
"role": "user",
"content": "What do you know about transformer architecure",
}
]
}
)Making It Interactive: Streaming Responses
The final piece of our multimodal RAG system is the streaming response functionality. This allows users to see the AI's response in real-time as it's generated, creating a more engaging conversational experience.
The stream_response function handles the complex streaming logic:
- It monitors the agent's thought process (tool calls)
- It captures the actual text responses
- It provides real-time output to the user
When you ask about transformer architecture, you'll see the AI "think" (potentially calling the retrieval tool) and then provide a streaming response based on the retrieved context.
This creates a seamless experience where the AI appears to have deep knowledge of the document while actually retrieving information on-demand.
def stream_response(messages):
"""Stream AI response"""
response = ""
for chunk, _ in agent.stream(
messages,
stream_mode="messages",
config={"configurable": {"thread_id": "sample-123"}},
):
if not chunk.content_blocks:
continue
# Message chunk
last_message = chunk.content_blocks[0]
# print(f"Chunk: {last_message}")
if last_message.get("type") == "tool_call_chunk":
print("Reasoning", end="\r")
# When agent start writes, last message will have keys type: text
if last_message.get("type") == "text":
chunk_content = last_message.get("text")
if chunk_content and "tool_call_output" not in chunk_content:
response += chunk_content
print(chunk_content, end="")
# print(response)Demo: Testing Our Complete System
Now let's put it all together! We're creating a sample conversation and streaming the response to demonstrate our complete multimodal RAG system in action.
When you run this, you'll see:
- The AI processing your question about transformer architecture
- Potentially calling the retrieval tool to get relevant context
- Streaming back a response that's informed by the actual document content
This demonstrates the full power of multimodal RAG - an AI that can search across both text and images to provide comprehensive, accurate answers about complex topics.
What We've Built
Congratulations! You've just created a sophisticated multimodal RAG system that can:
- Extract content from PDFs while preserving both text and images
- Generate embeddings for both modalities in a unified vector space
- Search efficiently across text and images simultaneously
- Provide intelligent answers using a local AI model with real-time retrieval
This system could be extended to handle multiple documents, different file types, or even web content. The possibilities are endless!
Next Steps
To make this production-ready, you might want to:
- Add persistent vector storage (replace in-memory Qdrant)
- Implement document preprocessing pipelines
- Add support for more file formats
- Create a web interface for easier interaction
- Add conversation memory for multi-turn dialogues
The foundation you've built here is solid and can scale to handle much more complex use cases.
messages = {
"messages": [
{
"role": "user",
"content": "What do you know about transformer architecure",
}
]
}stream_response(messages)