Why PDFs Break LLMs: The Technical Explanation

Every developer building AI systems hits the same wall: PDFs don't work well with language models. Tables come out scrambled. Headings disappear. Multi-column layouts produce interleaved gibberish. But why? The answer is that PDF is a page description language designed for printers, not a document format designed for reading. The mismatch between what a PDF actually contains and what an LLM needs is fundamental, not incidental.

This post is the technical explanation. We'll look at what's actually inside a PDF file, why extracting structured text from it is an unsolved (and possibly unsolvable) problem, and what this means for anyone building AI systems that need to process documents.

The Short Answer

PDFs break LLMs because the PDF format contains drawing instructions, not document structure. A PDF tells a renderer "draw the letter A at coordinates (72, 680)" — it does not say "this is a heading" or "this is a table cell." Extracting semantic structure from drawing coordinates is a fundamentally lossy process, and that lost structure is exactly what LLMs need to understand documents correctly.

The PostScript Heritage: PDFs Are Print Instructions

PDF stands for Portable Document Format, but a more accurate name would be Portable Display Format. The format descends from Adobe's PostScript, a page description language created in 1982 for laser printers. PostScript's job was to tell a printer exactly where to place ink on paper. PDF, introduced in 1993, adapted this model for screens.

At its core, a PDF file is a sequence of drawing operations. Here's a simplified example of what a PDF content stream actually looks like:

BT                          % Begin text object
/F1 24 Tf                   % Set font to F1 at 24pt
72 720 Td                   % Move to coordinates (72, 720)
(Quarterly Report) Tj       % Draw the string "Quarterly Report"
/F2 12 Tf                   % Switch to font F2 at 12pt
72 680 Td                   % Move to coordinates (72, 680)
(Revenue increased by 15%) Tj  % Draw this string
ET                          % End text object

Notice what's present: font names, sizes, coordinates, and raw strings. Notice what's completely absent: any indication that "Quarterly Report" is a heading, that it's the document title, or that it's semantically different from the body text below it. The only clue is the font size (24pt vs 12pt), and inferring document structure from font metrics is unreliable at best.

This is the foundational problem. A PDF contains no semantic layer. Every structural element — headings, paragraphs, lists, tables — exists only as visual appearance, not as labeled document components.

Coordinate-Based Layout: No Reading Order Guarantee

When you read a document, you follow a natural order: top to bottom, left to right (in English). You intuitively handle multi-column layouts by reading down one column before moving to the next. You know that a sidebar is separate from the main text.

A PDF has no concept of reading order. Text elements are positioned by absolute coordinates on the page. A PDF creator can draw text in any order — the body text first, then the header, then a footnote, then back to body text. The visual result looks correct because each piece is placed at the right coordinates, but the underlying content stream is jumbled.

Consider a two-column academic paper. The PDF might contain the text of both columns interleaved — a line from column one, then a line from column two, then back to column one. Or it might contain all of column one followed by all of column two. Or some completely different ordering. The PDF specification does not require any particular text ordering as long as the visual rendering is correct.

When a text extraction tool tries to recover the reading order, it has to sort text elements by their coordinates and guess where line breaks, column breaks, and paragraph boundaries occur. This heuristic works for simple single-column documents but fails routinely on:

Multi-column layouts: Text from different columns gets interleaved
Sidebars and callout boxes: Mixed into the main text flow
Headers and footers: Inserted at unpredictable positions
Footnotes: Merged with body text or lost entirely
Rotated text: Coordinate-based sorting fails completely

Why Tables Are the Hardest Problem

Tables in PDFs are not tables. They are collections of text fragments positioned near horizontal and vertical lines. There is no "table" object in the PDF specification. There are no cells, rows, or columns. There are just characters drawn at coordinates, optionally near some drawn lines.

To reconstruct a table from a PDF, an extraction tool must:

Detect that a table exists — usually by finding a grid of lines or a consistent alignment pattern
Identify the boundaries of each cell by mapping line segments to a grid structure
Associate text fragments with cells by checking which text coordinates fall within which cell boundaries
Determine reading order within cells — multi-line cell content needs to be read top-to-bottom
Handle merged cells where a cell spans multiple rows or columns
Distinguish header rows from data rows (often impossible without visual cues)

Each of these steps involves heuristics that can fail. A table without visible gridlines (common in modern design) makes step 1 and 2 much harder. Merged cells break the grid assumption. Multi-line cell content can be misassigned to adjacent cells. And header detection is essentially guesswork.

The Table Extraction Problem in Numbers

Academic benchmarks show that even state-of-the-art table extraction achieves 85-90% cell-level accuracy on well-formatted tables. For complex tables with merged cells and no gridlines, accuracy drops to 60-70%. This means that roughly 1 in 3 complex table cells will be wrong — enough to make financial data, scientific results, or legal references unreliable.

The Font Encoding Nightmare

Even extracting basic text from a PDF is harder than it should be, thanks to font encoding. A PDF can use custom font encodings where character codes don't map to standard Unicode values. The letter "A" might be stored as character code 42 in a custom encoding. Without the correct mapping table (called a ToUnicode CMap), extraction tools produce garbage characters.

Many PDF creators embed subset fonts — only the characters actually used in the document. These subsets often use non-standard encodings for efficiency. The font's internal name might be "ABCDEF+TimesNewRoman" where the prefix is randomly generated. If the ToUnicode mapping is missing or incomplete (which happens more often than you'd expect), text extraction produces output like "Wkh txlfn eurzq ira" instead of "The quick brown fox."

Ligatures add another layer of complexity. When "fi" is rendered as a single ligature glyph, some PDFs map it back to "fi" correctly, while others produce a single unrecognized character. The same issue affects "fl", "ff", "ffi", and other common ligatures.

Tagged PDFs: The Solution Nobody Uses

The PDF specification actually includes a feature called "Tagged PDF" (PDF/UA) that adds a semantic structure tree to the document. Tags like <H1>, <Table>, <P> can be attached to content, creating exactly the kind of semantic layer that would solve most extraction problems.

In practice, tagged PDFs are extremely rare. Creating properly tagged PDFs requires deliberate effort from the document creator, and most PDF generation tools either don't support tagging or produce incorrect tags. Studies have found that fewer than 5% of PDFs in the wild are properly tagged, and even among those, many have incorrect or incomplete tag structures.

The tagged PDF specification has existed since PDF 1.4 (2001) — over two decades. Its negligible adoption rate tells you everything about the likelihood of this solving the problem. For practical purposes, you cannot rely on PDF tags for text extraction.

Scanned PDFs: Images Pretending to Be Documents

Everything above assumes the PDF contains actual text content. Scanned PDFs contain no text at all — only images of pages. Every character must be recognized by an OCR (Optical Character Recognition) system, adding another entire layer of potential errors.

OCR technology has improved dramatically, but it's still imperfect. Common failure modes include:

Similar characters: "l" vs "1" vs "I", "O" vs "0", "rn" vs "m"
Low-quality scans: Faded text, skewed pages, shadows from book spines
Handwritten annotations: Typically unrecognizable or misread
Non-Latin scripts: Lower accuracy for complex writing systems
Mixed content: Documents with both printed text and handwriting

For scanned PDFs, you're stacking two lossy processes: OCR to get text, then heuristic extraction to get structure. Error compounds at each stage.

What LLMs Actually Need (And Why PDF Can't Provide It)

Large language models process text as sequences of tokens. They understand structure through inline markers — Markdown headings, HTML tags, or similar lightweight syntax. An LLM needs structured, linear text. A PDF provides unstructured, coordinate-based drawing commands. The gap between these two representations is where all PDF-to-AI problems originate.

Specifically, LLMs need:

Linear reading order: Text in the sequence a human would read it — PDFs provide randomly-ordered positioned fragments
Explicit structure markers: Headings labeled as headings, list items as list items — PDFs provide only visual differences in font and position
Table structure: Rows, columns, and cells explicitly delineated — PDFs provide text near lines (maybe)
Semantic relationships: Footnotes linked to references, captions linked to figures — PDFs provide no such links

This is why converting PDFs to Markdown before LLM processing produces dramatically better results. Markdown provides exactly the structured, linear representation that LLMs are designed to process. The conversion step is where the hard work of inferring structure from visual layout happens, and it's best done by specialized tools rather than by the LLM itself.

The Practical Solution: Convert Before You Process

Given that the PDF format is fundamentally mismatched with LLM requirements, the solution is straightforward: don't feed PDFs to LLMs. Convert them to Markdown first.

Modern AI-powered conversion tools like BlazeDocs use vision models that "see" the rendered page the way a human does, rather than trying to parse the PDF's content stream directly. This approach sidesteps many of the technical problems described above:

Reading order is inferred from visual layout, not content stream order
Table structure is recognized from the visual grid, not from coordinate heuristics
Headings are identified by visual prominence, not just font metrics
Font encoding issues are bypassed when OCR reads the rendered characters

The output is clean Markdown with explicit structure that LLMs can process natively. It's not perfect — no extraction from a lossy format can be — but it's dramatically better than what any LLM can do with raw PDF content.

Frequently Asked Questions

Why can't AI read PDFs properly?

AI can't read PDFs properly because the PDF format contains drawing instructions (coordinates, fonts, glyphs) rather than document structure (headings, paragraphs, tables). Extracting semantic structure from visual layout is a fundamentally lossy process. The information LLMs need to understand a document — reading order, heading hierarchy, table cell relationships — simply doesn't exist in the PDF file.

Why does ChatGPT mess up tables from my PDF?

Tables in PDFs aren't actually tables — they're text fragments positioned near drawn lines. ChatGPT's PDF parser must reconstruct the table grid from coordinates, which fails on merged cells, borderless tables, and multi-line cell content. Converting the PDF to Markdown first (using a tool like BlazeDocs) produces proper Markdown tables that ChatGPT can read accurately.

Will AI eventually be able to read PDFs natively?

Vision-language models are improving rapidly at "reading" rendered PDF pages as images. However, this approach has inherent resolution and accuracy limitations for dense documents. The most reliable pipeline will likely remain: render the PDF visually, use AI vision to extract structure, and output as Markdown or structured text for LLM consumption. This is essentially what tools like BlazeDocs already do.

Are all PDFs equally problematic for AI?

No. Simple, single-column, text-only PDFs extract reasonably well. The problems scale with document complexity. Financial reports with tables, academic papers with multi-column layouts, legal contracts with nested numbered lists, and scanned documents are all significantly harder. If your PDF has tables, columns, or was created by scanning, you should always convert to Markdown before AI processing.

Stop Fighting the Format

The PDF format is not going to change. It's a 30-year-old standard with trillions of existing documents. But you don't have to accept its limitations in your AI workflow. Convert your PDFs to Markdown and give your LLMs the structured input they're designed to process.

Try BlazeDocs free and see how clean, structured Markdown transforms your AI document processing. The technical problem is real — but the solution is straightforward.

If you're building agent workflows specifically, read our guide to PDF to Markdown for AI agents for a more implementation-focused version.