Skip to main content
Tutorial
14 min read

How to Build a RAG Knowledge Base from PDF Documents (Step-by-Step)

Complete walkthrough from raw PDFs to a working RAG system. Learn extraction, chunking, embedding, and retrieval for building production-grade knowledge bases.

BlazeDocs Team

Author

ragknowledge basetutorialpdf extractionvector databaseembeddings

Retrieval-augmented generation turns your PDF documents into an AI-powered knowledge base that answers questions with cited sources. But most RAG tutorials skip the hardest part: getting clean text out of PDFs in the first place. This guide walks through every step — from raw PDF files to a working RAG system — with practical code examples and the decisions that matter at each stage.

By the end, you'll have a complete mental model of the RAG pipeline and enough code to build a working prototype over a weekend. We'll use Python throughout, with BlazeDocs for PDF extraction and open-source tools for everything else.


RAG Architecture Overview

A RAG knowledge base has five components: document extraction, chunking, embedding, storage, and retrieval. Each step transforms your data from one form to another, and mistakes at early stages compound through the entire pipeline. Here's the flow:

PDFs → [Extraction] → Markdown → [Chunking] → Text Chunks 
→ [Embedding] → Vectors → [Storage] → Vector DB → [Retrieval] → Context → [LLM] → Answer

Most teams spend 80% of their optimization time on the last two steps (retrieval and LLM prompting) when the biggest accuracy gains come from the first two (extraction and chunking). Clean input data is the highest-leverage improvement you can make.


Step 1: Extract Text from PDFs (The Foundation)

PDF extraction is the most important step in your RAG pipeline because every downstream component depends on the quality of the extracted text. Bad extraction produces bad chunks, which produce bad embeddings, which produce bad retrieval results. No amount of prompt engineering fixes fundamentally broken source data.

Why Markdown Output Matters for RAG

You want your PDF extraction tool to output Markdown specifically (not plain text, not HTML) because Markdown headings give you natural chunk boundaries. A document with proper ## and ### markers can be split into semantic sections that each cover a complete topic. Plain text extraction loses these boundaries, forcing you to use arbitrary character-count splitting that frequently cuts sentences in half or merges unrelated paragraphs.

Converting PDFs with BlazeDocs

BlazeDocs converts PDFs to structured Markdown while preserving headings, tables, lists, and reading order. For building a RAG knowledge base, use the API to batch-convert your document library:

import requests
import os

BLAZEDOCS_API_KEY = os.environ["BLAZEDOCS_API_KEY"]

def convert_pdf_to_markdown(pdf_path: str) -> str:
    """Convert a single PDF to Markdown using BlazeDocs API."""
    with open(pdf_path, "rb") as f:
        response = requests.post(
            "https://api.blazedocs.io/v1/convert",
            headers={"Authorization": f"Bearer {BLAZEDOCS_API_KEY}"},
            files={"file": f}
        )
    response.raise_for_status()
    return response.text

# Batch convert all PDFs in a directory
pdf_dir = "./source_pdfs"
output_dir = "./markdown_output"
os.makedirs(output_dir, exist_ok=True)

for filename in os.listdir(pdf_dir):
    if filename.endswith(".pdf"):
        markdown = convert_pdf_to_markdown(os.path.join(pdf_dir, filename))
        output_path = os.path.join(output_dir, filename.replace(".pdf", ".md"))
        with open(output_path, "w") as f:
            f.write(markdown)
        print(f"Converted: {filename}")

Step 2: Chunk Documents Intelligently

Chunking is the process of splitting documents into smaller pieces that fit into embedding model context windows and provide focused, retrievable units of information. The goal is chunks that are large enough to be self-contained but small enough to be specific. Target 200-500 tokens per chunk for most use cases.

Heading-Based Chunking (Recommended)

Because your Markdown has proper heading structure from BlazeDocs, you can split on headings to create semantically coherent chunks. Each chunk covers a single topic or section:

import re
from dataclasses import dataclass

@dataclass
class Chunk:
    text: str
    metadata: dict

def chunk_markdown_by_headings(markdown: str, source_file: str) -> list[Chunk]:
    """Split Markdown into chunks based on heading boundaries."""
    # Split on h2 and h3 headings
    sections = re.split(r'(?=^#{2,3} )', markdown, flags=re.MULTILINE)
    
    chunks = []
    for section in sections:
        section = section.strip()
        if not section or len(section) < 50:  # Skip tiny sections
            continue
        
        # Extract heading for metadata
        heading_match = re.match(r'^(#{2,3}) (.+)', section)
        heading = heading_match.group(2) if heading_match else "Introduction"
        
        chunks.append(Chunk(
            text=section,
            metadata={
                "source": source_file,
                "heading": heading,
                "char_count": len(section)
            }
        ))
    
    return chunks

Recursive Splitting for Long Sections

Some sections will be too long even after heading-based splitting. Use recursive character splitting as a fallback for sections that exceed your token limit:

def recursive_split(text: str, max_chars: int = 1500, overlap: int = 200) -> list[str]:
    """Split long text into overlapping chunks."""
    if len(text) <= max_chars:
        return [text]
    
    # Try to split on paragraph boundaries first
    paragraphs = text.split("\n\n")
    chunks = []
    current_chunk = ""
    
    for para in paragraphs:
        if len(current_chunk) + len(para) > max_chars and current_chunk:
            chunks.append(current_chunk.strip())
            # Keep overlap from end of previous chunk
            words = current_chunk.split()
            overlap_text = " ".join(words[-overlap//6:])  # ~overlap chars
            current_chunk = overlap_text + "\n\n" + para
        else:
            current_chunk += "\n\n" + para if current_chunk else para
    
    if current_chunk.strip():
        chunks.append(current_chunk.strip())
    
    return chunks

Chunking Mistakes to Avoid

  • Fixed-size character splitting — Cutting at arbitrary positions breaks sentences and separates related information
  • No overlap — Without overlap, queries that span chunk boundaries miss relevant content
  • Ignoring tables — Tables should be kept as complete units, never split mid-row
  • Discarding metadata — Source filename, heading, and page number are critical for citation and debugging

Step 3: Generate Embeddings

Embeddings convert text chunks into numerical vectors that capture semantic meaning, enabling similarity-based retrieval. When a user asks a question, you embed the question with the same model and find the chunks whose vectors are closest.

from openai import OpenAI

client = OpenAI()

def embed_chunks(chunks: list[Chunk], model: str = "text-embedding-3-small") -> list[dict]:
    """Generate embeddings for a list of chunks."""
    texts = [chunk.text for chunk in chunks]
    
    # Batch embed (API supports up to 2048 inputs)
    response = client.embeddings.create(input=texts, model=model)
    
    results = []
    for i, embedding_data in enumerate(response.data):
        results.append({
            "text": chunks[i].text,
            "embedding": embedding_data.embedding,
            "metadata": chunks[i].metadata
        })
    
    return results

Choosing an Embedding Model

ModelDimensionsBest ForCost
OpenAI text-embedding-3-small1536General purpose, good cost/quality$0.02/1M tokens
OpenAI text-embedding-3-large3072Highest accuracy needs$0.13/1M tokens
Cohere embed-v31024Multilingual documentsFree tier available
BGE-large (open source)1024Self-hosted, no API costsFree (compute costs)

For most teams starting out, text-embedding-3-small offers the best balance of quality and cost. You can always re-embed with a larger model later — the rest of the pipeline stays the same.


Step 4: Store in a Vector Database

A vector database stores your embeddings and enables fast similarity search across millions of vectors. For a RAG knowledge base, you need a database that supports metadata filtering (to search within specific document categories) and returns both the vector match score and the original text.

# Example using Pinecone
from pinecone import Pinecone

pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
index = pc.Index("knowledge-base")

def store_embeddings(embedded_chunks: list[dict]):
    """Store embedded chunks in Pinecone."""
    vectors = []
    for i, chunk in enumerate(embedded_chunks):
        vectors.append({
            "id": f"{chunk['metadata']['source']}_{i}",
            "values": chunk["embedding"],
            "metadata": {
                "text": chunk["text"],
                "source": chunk["metadata"]["source"],
                "heading": chunk["metadata"]["heading"]
            }
        })
    
    # Upsert in batches of 100
    for batch_start in range(0, len(vectors), 100):
        batch = vectors[batch_start:batch_start + 100]
        index.upsert(vectors=batch)

# Alternative: pgvector for PostgreSQL users
# pip install pgvector
# CREATE EXTENSION vector;
# CREATE TABLE chunks (
#   id SERIAL PRIMARY KEY,
#   text TEXT,
#   embedding vector(1536),
#   metadata JSONB
# );

Step 5: Build the Retrieval Pipeline

The retrieval pipeline takes a user question, finds the most relevant chunks, and generates an answer using an LLM with those chunks as context. This is where all the previous steps come together.

def answer_question(question: str, top_k: int = 5) -> str:
    """Full RAG pipeline: question → retrieval → answer."""
    
    # 1. Embed the question
    q_embedding = client.embeddings.create(
        input=[question], 
        model="text-embedding-3-small"
    ).data[0].embedding
    
    # 2. Search for relevant chunks
    results = index.query(
        vector=q_embedding,
        top_k=top_k,
        include_metadata=True
    )
    
    # 3. Assemble context from retrieved chunks
    context_parts = []
    for match in results.matches:
        source = match.metadata["source"]
        heading = match.metadata["heading"]
        text = match.metadata["text"]
        context_parts.append(f"[Source: {source} > {heading}]\n{text}")
    
    context = "\n\n---\n\n".join(context_parts)
    
    # 4. Generate answer with LLM
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": 
                "Answer the user's question based on the provided context. "
                "Cite sources using [Source: filename] format. "
                "If the context doesn't contain the answer, say so."},
            {"role": "user", "content": 
                f"Context:\n{context}\n\nQuestion: {question}"}
        ]
    )
    
    return response.choices[0].message.content

# Usage
answer = answer_question("What is our refund policy for enterprise customers?")
print(answer)

Common RAG Pitfalls and How to Avoid Them

Pitfall 1: Garbage In, Garbage Out

The single most common RAG failure is poor PDF extraction. Teams use basic text extraction (PyPDF2, pdfplumber) that mangles tables, drops headings, and scrambles multi-column layouts. The embeddings faithfully represent this garbage, and retrieval faithfully returns it. Fix extraction first — everything else is secondary.

Pitfall 2: Chunks Too Small or Too Large

Chunks under 100 tokens lack context — the embedding can't capture meaning from a sentence fragment. Chunks over 1000 tokens are too broad — they match many queries but dilute the relevant information with surrounding noise. Aim for 200-500 tokens with clear topic focus.

Pitfall 3: No Metadata Filtering

Without metadata, every query searches your entire knowledge base. When a user asks about "2026 pricing," they get chunks from 2024 pricing documents, blog posts mentioning prices, and employee handbook salary information. Store and filter by document type, date, category, and department.

Pitfall 4: Not Evaluating Retrieval Quality

Most teams evaluate RAG by reading final answers. But if retrieval returns the wrong chunks, even a perfect LLM generates wrong answers. Evaluate retrieval separately: for each test question, check whether the correct source chunk appears in the top-k results.


Frequently Asked Questions

How many PDFs do I need to build a useful RAG knowledge base?

You can build a useful RAG system with as few as 10-20 documents. Quality matters more than quantity. A well-extracted, properly chunked set of 20 key documents outperforms a poorly processed collection of 1,000 PDFs. Start small, validate quality, then scale.

What's the best PDF extraction tool for RAG?

For RAG specifically, you need a tool that outputs structured Markdown with preserved headings, tables, and reading order. BlazeDocs is designed for this use case — its Markdown output provides natural chunk boundaries and preserves the semantic structure that makes retrieval accurate. Tools that output plain text lose the structure needed for intelligent chunking.

How much does it cost to run a RAG system?

For a 500-document knowledge base: PDF conversion with BlazeDocs runs a few dollars total. Embedding with text-embedding-3-small costs under $1 for the entire corpus. Vector database hosting starts at $0-25/month depending on the provider. The ongoing cost is primarily LLM inference for answering queries — roughly $0.01-0.05 per question with GPT-4o.

Should I use LangChain or build my own RAG pipeline?

For learning and prototyping, build your own — the code is surprisingly simple (as shown above). For production systems with complex requirements (hybrid search, reranking, guardrails), frameworks like LangChain or LlamaIndex save time. The extraction and chunking steps are the same either way.


Start Building Your RAG Knowledge Base

The path from PDFs to a working RAG system is shorter than most tutorials make it seem. The hardest part — clean PDF extraction — is solved by using the right tool upfront. Sign up for BlazeDocs to convert your PDF library to clean Markdown, then follow the steps above to build a RAG pipeline that actually works.

Remember: optimize extraction and chunking before you touch anything else. Clean input data is the single biggest lever for RAG accuracy.

Continue Reading

More insights and guides to enhance your workflow

Convert Your First PDF Free

5 pages/month free. No credit card required. AI-powered accuracy with tables, formulas, and code blocks preserved.

No credit card5 pages freeObsidian & Notion ready