PDF to Markdown API: Convert Documents Programmatically

If you're building a retrieval-augmented generation (RAG) pipeline, a document processing workflow, or any system that ingests unstructured data, you've almost certainly hit the same wall: PDFs are notoriously difficult to parse. Tables get mangled, headers vanish, and the semantic structure of a document is lost entirely when you try to extract raw text. A purpose-built PDF to Markdown API solves this by preserving document structure—headings, lists, tables, and code blocks—in a format that large language models and vector databases can actually work with.

In this guide, we'll walk through the BlazeDocs conversion API end-to-end. You'll learn how to convert PDF to Markdown programmatically using Python, Node.js, and cURL, then see how to plug the output straight into a RAG pipeline. Whether you're processing five pages a month or five thousand, the API is designed to scale with you.

The BlazeDocs API Endpoint

BlazeDocs exposes a single, powerful endpoint for document conversion. Send a PDF in, receive clean Markdown out. It really is that straightforward.

Endpoint

POST https://blazedocs.io/api/v1/convert

Authentication: Include your API key in the Authorization header as a Bearer token. You can generate a key from your BlazeDocs dashboard after signing up.

Content type: multipart/form-data. Attach your PDF file under the file field.

Response: A JSON object containing the converted Markdown string, page count, and metadata. Full schema details are available in the API documentation.

Under the hood, BlazeDocs leverages Mistral AI's OCR engine to extract text with exceptional accuracy—even from scanned documents, complex multi-column layouts, and image-heavy PDFs. The result is Markdown that faithfully mirrors the original document's hierarchy.

Python Example: PDF to Markdown

The most common way to convert PDF to Markdown in Python is with the requests library. Here's a minimal working example:

import requests

API_KEY = "your_api_key_here"
PDF_PATH = "document.pdf"

url = "https://blazedocs.io/api/v1/convert"
headers = {
    "Authorization": f"Bearer {API_KEY}"
}

with open(PDF_PATH, "rb") as f:
    files = {"file": (PDF_PATH, f, "application/pdf")}
    response = requests.post(url, headers=headers, files=files)

if response.status_code == 200:
    data = response.json()
    markdown = data["markdown"]
    print(f"Converted {data['pages']} pages successfully.")
    
    # Save to file
    with open("output.md", "w") as out:
        out.write(markdown)
else:
    print(f"Error: {response.status_code} - {response.text}")

That's all it takes. The response JSON includes a markdown field with the full converted text, a pages count, and additional metadata such as detected language and document title. For batch processing, simply loop over a directory of PDFs and call the endpoint for each file.

Node.js / TypeScript Example

If you're working in a Node.js or TypeScript environment, you can use the native fetch API (available in Node 18+) along with FormData:

import fs from "node:fs";
import path from "node:path";

const API_KEY = "your_api_key_here";
const PDF_PATH = "document.pdf";

async function convertPdfToMarkdown(filePath: string): Promise<string> {
  const file = new Blob([fs.readFileSync(filePath)], {
    type: "application/pdf",
  });

  const formData = new FormData();
  formData.append("file", file, path.basename(filePath));

  const response = await fetch("https://blazedocs.io/api/v1/convert", {
    method: "POST",
    headers: {
      Authorization: `Bearer ${API_KEY}`,
    },
    body: formData,
  });

  if (!response.ok) {
    throw new Error(`Conversion failed: ${response.statusText}`);
  }

  const data = await response.json();
  console.log(`Converted ${data.pages} pages successfully.`);
  return data.markdown;
}

// Usage
const markdown = await convertPdfToMarkdown(PDF_PATH);
fs.writeFileSync("output.md", markdown);

cURL Example

For quick testing or shell scripts, cURL works perfectly:

curl -X POST https://blazedocs.io/api/v1/convert \
  -H "Authorization: Bearer your_api_key_here" \
  -F "file=@document.pdf" \
  -o response.json

The -F flag sends the file as multipart form data, and -o saves the JSON response to a file. You can then extract the Markdown with jq '.markdown' response.json.

Building a RAG Pipeline with the API

The real power of a PDF to Markdown API shines when you integrate it into a retrieval-augmented generation pipeline. Because the output preserves semantic structure— headings become ## markers, tables become pipe-delimited grids, lists retain their nesting—you get far better chunking and retrieval quality than with raw text extraction.

Here's a Python example that converts a PDF, splits the resulting Markdown into chunks, generates embeddings, and upserts them into a vector database:

import requests
from langchain.text_splitter import MarkdownHeaderTextSplitter
from openai import OpenAI

# Step 1: Convert PDF to Markdown via BlazeDocs
API_KEY = "your_blazedocs_api_key"

with open("research_paper.pdf", "rb") as f:
    response = requests.post(
        "https://blazedocs.io/api/v1/convert",
        headers={"Authorization": f"Bearer {API_KEY}"},
        files={"file": ("research_paper.pdf", f, "application/pdf")},
    )

markdown = response.json()["markdown"]

# Step 2: Split by Markdown headers for semantic chunking
splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[
        ("#", "title"),
        ("##", "section"),
        ("###", "subsection"),
    ]
)
chunks = splitter.split_text(markdown)

# Step 3: Generate embeddings
client = OpenAI()
documents = []

for i, chunk in enumerate(chunks):
    embedding = client.embeddings.create(
        model="text-embedding-3-small",
        input=chunk.page_content,
    )
    documents.append({
        "id": f"chunk_{i}",
        "text": chunk.page_content,
        "metadata": chunk.metadata,
        "embedding": embedding.data[0].embedding,
    })

# Step 4: Upsert into your vector database
# e.g., Pinecone, Weaviate, Qdrant, ChromaDB
print(f"Processed {len(documents)} chunks, ready for upsert.")

💡 Why Markdown Beats Plain Text for RAG

When you chunk raw text, you lose all structural context. A heading like "Methodology" becomes indistinguishable from body copy. With Markdown, the MarkdownHeaderTextSplitter can intelligently split on section boundaries, preserving the semantic relationship between headings and their content. This leads to significantly better retrieval accuracy and more relevant LLM responses.

Handling Tables, Images, and Headers

One of the trickiest aspects of PDF conversion is preserving complex content types. BlazeDocs handles these gracefully:

Tables are converted to proper Markdown pipe tables with aligned columns. Even nested tables and cells with multi-line content are supported. The Mistral AI OCR engine detects table boundaries with high accuracy, so you won't see columns bleeding into one another.
Images are extracted and referenced as Markdown image tags. If the image contains text (e.g., a diagram with labels), OCR is applied to extract that text as well, and it's included as an alt-text description.
Headers and hierarchy are mapped to the appropriate Markdown heading levels (#, ##, ###, etc.) based on font size and weight analysis. This means your converted document retains its original outline structure.
Lists—both ordered and unordered—are detected and converted with proper nesting. Even lists that span page breaks are stitched back together correctly.
Code blocks in technical documents are identified and wrapped in fenced code blocks with language hints where possible.

For a comprehensive overview of supported elements and edge cases, visit the API documentation.

Rate Limits and Pricing Tiers

BlazeDocs offers flexible pricing to suit everything from side projects to enterprise document processing pipelines. Each plan includes API access with generous rate limits:

Free — 5 pages per month. Perfect for testing the API and evaluating conversion quality before committing. No credit card required.
Starter ($9.99/month) — 100 pages per month. Ideal for individual developers and small projects. Includes priority support via email.
Pro ($29.99/month) — 500 pages per month. Built for teams and production workloads. Includes higher rate limits and webhook notifications for async processing.
Business ($99.99/month) — Unlimited pages. Designed for organisations processing documents at scale. Includes dedicated support, custom SLAs, and the highest rate limits available.

📊 What Counts as a Page?

A "page" is a single page within a PDF document. A 10-page PDF counts as 10 pages against your monthly allowance. If a conversion fails due to a server error, the pages are not deducted from your quota.

Rate limits are applied per minute and vary by tier. The Free plan allows 10 requests per minute, whilst the Enterprise plan supports up to 200 requests per minute. If you exceed your rate limit, the API returns a 429 Too Many Requests response with a Retry-After header indicating when you can send your next request.

Get Started Today

Whether you're building a RAG pipeline, automating document workflows, or simply need a reliable way to convert PDF to Markdown programmatically, BlazeDocs gives you a production-ready API that handles the complexity so you don't have to. Start with the free tier—no credit card required—and upgrade as your needs grow.

Ready to convert your first PDF?

Start Converting for Free →