How to Build a RAG Knowledge Base from PDF Documents (Step-by-Step)

TL;DR — what's the quick answer?

Convert PDFs to clean Markdown, chunk by heading, embed, and store in a vector DB.
Markdown lets you chunk along real document boundaries, improving retrieval relevance.
Poor RAG usually traces to the parser, not the embedding model — fix ingestion first.

Retrieval-augmented generation turns your PDF documents into an AI-powered knowledge base that answers questions with cited sources. But most RAG tutorials skip the hardest part: getting clean text out of PDFs in the first place. This guide walks through every step — from raw PDF files to a working RAG system — with practical code examples and the decisions that matter at each stage.

By the end, you'll have a complete mental model of the RAG pipeline and enough code to build a working prototype over a weekend. We'll use Python throughout, with BlazeDocs for PDF extraction and open-source tools for everything else.

RAG Architecture Overview

A RAG knowledge base has five components: document extraction, chunking, embedding, storage, and retrieval. Each step transforms your data from one form to another, and mistakes at early stages compound through the entire pipeline. Here's the flow:

PDFs → [Extraction] → Markdown → [Chunking] → Text Chunks 
→ [Embedding] → Vectors → [Storage] → Vector DB → [Retrieval] → Context → [LLM] → Answer

Most teams spend 80% of their optimization time on the last two steps (retrieval and LLM prompting) when the biggest accuracy gains come from the first two (extraction and chunking). Clean input data is the highest-leverage improvement you can make.

Step 1: Extract Text from PDFs (The Foundation)

PDF extraction is the most important step in your RAG pipeline because every downstream component depends on the quality of the extracted text. Bad extraction produces bad chunks, which produce bad embeddings, which produce bad retrieval results. No amount of prompt engineering fixes fundamentally broken source data.

Why Markdown Output Matters for RAG

You want your PDF extraction tool to output Markdown specifically (not plain text, not HTML) because Markdown headings give you natural chunk boundaries. A document with proper ## and ### markers can be split into semantic sections that each cover a complete topic. Plain text extraction loses these boundaries, forcing you to use arbitrary character-count splitting that frequently cuts sentences in half or merges unrelated paragraphs.

Converting PDFs with BlazeDocs

BlazeDocs converts PDFs to structured Markdown while preserving headings, tables, lists, and reading order. For building a RAG knowledge base, use the API to batch-convert your document library:

import requests
import os

BLAZEDOCS_API_KEY = os.environ["BLAZEDOCS_API_KEY"]

def convert_pdf_to_markdown(pdf_path: str) -> str:
    """Convert a single PDF to Markdown using BlazeDocs API."""
    with open(pdf_path, "rb") as f:
        response = requests.post(
            "https://api.blazedocs.io/v1/convert",
            headers={"Authorization": f"Bearer {BLAZEDOCS_API_KEY}"},
            files={"file": f}
        )
    response.raise_for_status()
    return response.text

# Batch convert all PDFs in a directory
pdf_dir = "./source_pdfs"
output_dir = "./markdown_output"
os.makedirs(output_dir, exist_ok=True)

for filename in os.listdir(pdf_dir):
    if filename.endswith(".pdf"):
        markdown = convert_pdf_to_markdown(os.path.join(pdf_dir, filename))
        output_path = os.path.join(output_dir, filename.replace(".pdf", ".md"))
        with open(output_path, "w") as f:
            f.write(markdown)
        print(f"Converted: {filename}")

Step 2: Chunk Documents Intelligently

Chunking is the process of splitting documents into smaller pieces that fit into embedding model context windows and provide focused, retrievable units of information. The goal is chunks that are large enough to be self-contained but small enough to be specific. Target 200-500 tokens per chunk for most use cases.

Heading-Based Chunking (Recommended)

Because your Markdown has proper heading structure from BlazeDocs, you can split on headings to create semantically coherent chunks. Each chunk covers a single topic or section:

import re
from dataclasses import dataclass

@dataclass
class Chunk:
    text: str
    metadata: dict

def chunk_markdown_by_headings(markdown: str, source_file: str) -> list[Chunk]:
    """Split Markdown into chunks based on heading boundaries."""
    # Split on h2 and h3 headings
    sections = re.split(r'(?=^#{2,3} )', markdown, flags=re.MULTILINE)
    
    chunks = []
    for section in sections:
        section = section.strip()
        if not section or len(section) < 50:  # Skip tiny sections
            continue
        
        # Extract heading for metadata
        heading_match = re.match(r'^(#{2,3}) (.+)', section)
        heading = heading_match.group(2) if heading_match else "Introduction"
        
        chunks.append(Chunk(
            text=section,
            metadata={
                "source": source_file,
                "heading": heading,
                "char_count": len(section)
            }
        ))
    
    return chunks

Recursive Splitting for Long Sections

Some sections will be too long even after heading-based splitting. Use recursive character splitting as a fallback for sections that exceed your token limit:

def recursive_split(text: str, max_chars: int = 1500, overlap: int = 200) -> list[str]:
    """Split long text into overlapping chunks."""
    if len(text) <= max_chars:
        return [text]
    
    # Try to split on paragraph boundaries first
    paragraphs = text.split("\n\n")
    chunks = []
    current_chunk = ""
    
    for para in paragraphs:
        if len(current_chunk) + len(para) > max_chars and current_chunk:
            chunks.append(current_chunk.strip())
            # Keep overlap from end of previous chunk
            words = current_chunk.split()
            overlap_text = " ".join(words[-overlap//6:])  # ~overlap chars
            current_chunk = overlap_text + "\n\n" + para
        else:
            current_chunk += "\n\n" + para if current_chunk else para
    
    if current_chunk.strip():
        chunks.append(current_chunk.strip())
    
    return chunks

Chunking Mistakes to Avoid

Fixed-size character splitting — Cutting at arbitrary positions breaks sentences and separates related information
No overlap — Without overlap, queries that span chunk boundaries miss relevant content
Ignoring tables — Tables should be kept as complete units, never split mid-row
Discarding metadata — Source filename, heading, and page number are critical for citation and debugging

Step 3: Generate Embeddings

Embeddings convert text chunks into numerical vectors that capture semantic meaning, enabling similarity-based retrieval. When a user asks a question, you embed the question with the same model and find the chunks whose vectors are closest.

from openai import OpenAI

client = OpenAI()

def embed_chunks(chunks: list[Chunk], model: str = "text-embedding-3-small") -> list[dict]:
    """Generate embeddings for a list of chunks."""
    texts = [chunk.text for chunk in chunks]
    
    # Batch embed (API supports up to 2048 inputs)
    response = client.embeddings.create(input=texts, model=model)
    
    results = []
    for i, embedding_data in enumerate(response.data):
        results.append({
            "text": chunks[i].text,
            "embedding": embedding_data.embedding,
            "metadata": chunks[i].metadata
        })
    
    return results

Choosing an Embedding Model

Model	Dimensions	Best For	Cost
OpenAI text-embedding-3-small	1536	General purpose, good cost/quality	$0.02/1M tokens
OpenAI text-embedding-3-large	3072	Highest accuracy needs	$0.13/1M tokens
Cohere embed-v3	1024	Multilingual documents	Free tier available
BGE-large (open source)	1024	Self-hosted, no API costs	Free (compute costs)

For most teams starting out, text-embedding-3-small offers the best balance of quality and cost. You can always re-embed with a larger model later — the rest of the pipeline stays the same.

Step 4: Store in a Vector Database

A vector database stores your embeddings and enables fast similarity search across millions of vectors. For a RAG knowledge base, you need a database that supports metadata filtering (to search within specific document categories) and returns both the vector match score and the original text.

# Example using Pinecone
from pinecone import Pinecone

pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
index = pc.Index("knowledge-base")

def store_embeddings(embedded_chunks: list[dict]):
    """Store embedded chunks in Pinecone."""
    vectors = []
    for i, chunk in enumerate(embedded_chunks):
        vectors.append({
            "id": f"{chunk['metadata']['source']}_{i}",
            "values": chunk["embedding"],
            "metadata": {
                "text": chunk["text"],
                "source": chunk["metadata"]["source"],
                "heading": chunk["metadata"]["heading"]
            }
        })
    
    # Upsert in batches of 100
    for batch_start in range(0, len(vectors), 100):
        batch = vectors[batch_start:batch_start + 100]
        index.upsert(vectors=batch)

# Alternative: pgvector for PostgreSQL users
# pip install pgvector
# CREATE EXTENSION vector;
# CREATE TABLE chunks (
#   id SERIAL PRIMARY KEY,
#   text TEXT,
#   embedding vector(1536),
#   metadata JSONB
# );

Step 5: Build the Retrieval Pipeline

The retrieval pipeline takes a user question, finds the most relevant chunks, and generates an answer using an LLM with those chunks as context. This is where all the previous steps come together.

def answer_question(question: str, top_k: int = 5) -> str:
    """Full RAG pipeline: question → retrieval → answer."""
    
    # 1. Embed the question
    q_embedding = client.embeddings.create(
        input=[question], 
        model="text-embedding-3-small"
    ).data[0].embedding
    
    # 2. Search for relevant chunks
    results = index.query(
        vector=q_embedding,
        top_k=top_k,
        include_metadata=True
    )
    
    # 3. Assemble context from retrieved chunks
    context_parts = []
    for match in results.matches:
        source = match.metadata["source"]
        heading = match.metadata["heading"]
        text = match.metadata["text"]
        context_parts.append(f"[Source: {source} > {heading}]\n{text}")
    
    context = "\n\n---\n\n".join(context_parts)
    
    # 4. Generate answer with LLM
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": 
                "Answer the user's question based on the provided context. "
                "Cite sources using [Source: filename] format. "
                "If the context doesn't contain the answer, say so."},
            {"role": "user", "content": 
                f"Context:\n{context}\n\nQuestion: {question}"}
        ]
    )
    
    return response.choices[0].message.content

# Usage
answer = answer_question("What is our refund policy for enterprise customers?")
print(answer)

Common RAG Pitfalls and How to Avoid Them

Pitfall 1: Garbage In, Garbage Out

The single most common RAG failure is poor PDF extraction. Teams use basic text extraction (PyPDF2, pdfplumber) that mangles tables, drops headings, and scrambles multi-column layouts. The embeddings faithfully represent this garbage, and retrieval faithfully returns it. Fix extraction first — everything else is secondary.

Pitfall 2: Chunks Too Small or Too Large

Chunks under 100 tokens lack context — the embedding can't capture meaning from a sentence fragment. Chunks over 1000 tokens are too broad — they match many queries but dilute the relevant information with surrounding noise. Aim for 200-500 tokens with clear topic focus.

Pitfall 3: No Metadata Filtering

Without metadata, every query searches your entire knowledge base. When a user asks about "2026 pricing," they get chunks from 2024 pricing documents, blog posts mentioning prices, and employee handbook salary information. Store and filter by document type, date, category, and department.

Pitfall 4: Not Evaluating Retrieval Quality

Most teams evaluate RAG by reading final answers. But if retrieval returns the wrong chunks, even a perfect LLM generates wrong answers. Evaluate retrieval separately: for each test question, check whether the correct source chunk appears in the top-k results.

Start Building Your RAG Knowledge Base

The path from PDFs to a working RAG system is shorter than most tutorials make it seem. The hardest part — clean PDF extraction — is solved by using the right tool upfront. Sign up for BlazeDocs to convert your PDF library to clean Markdown, then follow the steps above to build a RAG pipeline that actually works.

Remember: optimize extraction and chunking before you touch anything else. Clean input data is the single biggest lever for RAG accuracy.

Where can you verify these claims?

We link primary sources and our own editorial benchmarks — not unsourced accuracy stats.

PDF Parser Arena — BlazeDocs editorial scorecard (May 2026) on Markdown quality, tables, and RAG readiness.
BlazeDocs API docs — REST conversion endpoint, auth, and integration examples for the claims about programmatic conversion.
LlamaParse on LlamaCloud — Official LlamaIndex parsing docs and free-tier details.
Unstructured (GitHub) — Open-source document ETL toolkit for self-hosted pipelines.

Continue exploring PDF to Markdown workflows, comparisons, and AI pipeline guides.

What questions do people ask about this topic?

How do I build a RAG knowledge base from PDFs?

Convert PDFs to clean Markdown, chunk by heading, embed the chunks, and store them in a vector database. Parsing quality matters most—broken tables or reading order degrade every later step.

Why convert PDFs to Markdown before chunking?

Markdown preserves headings and tables, so you can chunk along real document boundaries instead of arbitrary character counts. This improves retrieval relevance and answer quality.

What is the biggest cause of poor RAG results?

Usually the parser, not the embedding model. If the PDF-to-Markdown step mangles structure, retrieval fails regardless of model choice. Fix ingestion first—see blazedocs.io/benchmarks.

How to Build a RAG Knowledge Base from PDF Documents (Step-by-Step)

TL;DR — what's the quick answer?

RAG Architecture Overview

Step 1: Extract Text from PDFs (The Foundation)

Why Markdown Output Matters for RAG

Converting PDFs with BlazeDocs

Step 2: Chunk Documents Intelligently

Heading-Based Chunking (Recommended)

Recursive Splitting for Long Sections

Chunking Mistakes to Avoid

Step 3: Generate Embeddings

Choosing an Embedding Model

Step 4: Store in a Vector Database

Step 5: Build the Retrieval Pipeline

Common RAG Pitfalls and How to Avoid Them

Pitfall 1: Garbage In, Garbage Out

Pitfall 2: Chunks Too Small or Too Large

Pitfall 3: No Metadata Filtering

Pitfall 4: Not Evaluating Retrieval Quality

Start Building Your RAG Knowledge Base

Where can you verify these claims?

What questions do people ask about this topic?

How do I build a RAG knowledge base from PDFs?

Why convert PDFs to Markdown before chunking?

What is the biggest cause of poor RAG results?

Get conversion tips

Continue Reading

Convert Your First PDF Free

How to Build a RAG Knowledge Base from PDF Documents (Step-by-Step)

TL;DR — what's the quick answer?

RAG Architecture Overview

Step 1: Extract Text from PDFs (The Foundation)

Why Markdown Output Matters for RAG

Converting PDFs with BlazeDocs

Step 2: Chunk Documents Intelligently

Heading-Based Chunking (Recommended)

Recursive Splitting for Long Sections

Chunking Mistakes to Avoid

Step 3: Generate Embeddings

Choosing an Embedding Model

Step 4: Store in a Vector Database

Step 5: Build the Retrieval Pipeline

Common RAG Pitfalls and How to Avoid Them

Pitfall 1: Garbage In, Garbage Out

Pitfall 2: Chunks Too Small or Too Large

Pitfall 3: No Metadata Filtering

Pitfall 4: Not Evaluating Retrieval Quality

Start Building Your RAG Knowledge Base

Where can you verify these claims?

Which related guides should you read next?

What questions do people ask about this topic?

How do I build a RAG knowledge base from PDFs?

Why convert PDFs to Markdown before chunking?

What is the biggest cause of poor RAG results?

Get conversion tips

Continue Reading

Convert Your First PDF Free