Breaking Through the Document Recognition Bottleneck: How Traditional OCR and Multimodal LLMs Work Better Together

0. Introduction: A Multi-Billion Dollar Problem

The global OCR market reached ** $13.95 billion** in 2024 and is projected to grow to$ 46.09 billion by 2033 (CAGR of 13.06%). Meanwhile, the Intelligent Document Processing (IDP) market is expanding even faster at a 33.1% CAGR, expected to reach $12.35 billion by 2030.

Behind these numbers lies a stark technology gap: market demand is exploding, but the core OCR technology still falls far short of "good enough."

In the most critical document digitization use case — converting PDFs into beautifully readable EPUB ebooks — Optical Character Recognition (OCR) remains the foundational technology. Yet whether we look at established open-source engines like Tesseract, accuracy-focused tools like PaddleOCR, or commercial cloud-based OCR services, they all share one fatal flaw: a complete lack of semantic understanding of text content.

Without the ability to infer characters from context, traditional OCR struggles to achieve reading-grade accuracy in real-world applications. The most painful failure modes include:

Image and background confusion: Complex backgrounds, watermarked text, and decorative fonts produce torrents of garbage characters.
Fragmented chart recognition: Text within data charts, flowcharts, and diagrams — which carry strong logical relationships — gets shattered into isolated, meaningless character fragments. Merged table cells are especially catastrophic.
Poor handling of vertical and complex layouts: Vertically typeset texts, multi-column layouts, and formula-dense academic papers routinely defeat traditional feature-extraction-based models.

These are not edge cases. They are the everyday reality of PDF documents in the wild.

1. The Numbers Don't Lie: Where Traditional OCR Falls Short

To quantify these shortcomings, let's examine publicly available benchmark data:

1.1 Character Error Rate (CER) Comparison

Model	Processing Speed	Character Error Rate (CER)	Best For
Tesseract 5 (LSTM)	8.2 fps (CPU)	18%	Simple printed text
PaddleOCR	12.7 fps (GPU)	10%	General documents
EasyOCR	Slower	~9%	Multilingual scenarios

What does an 18% character error rate mean in practice? On a page with 300 characters, an average of 54 characters will be wrong. For EPUB conversion intended for comfortable reading, this level of accuracy is completely unacceptable.

Even PaddleOCR, the best performer in this group, still produces roughly 30 incorrect characters per page at its 10% CER — nearly one error per line. In a comparative study on Gujarati text recognition, PaddleOCR achieved an F1-Score of 0.938 versus Tesseract's 0.797, a significant gap that highlights just how much variance exists across engines.

1.2 Complex Layouts: Where Things Fall Apart

These numbers were measured under ideal conditions. With real-world complex documents, accuracy degrades dramatically:

Multi-column layouts: Tesseract frequently fails to separate dual-column text into the correct reading order, interleaving left and right columns into an incomprehensible word soup.
Tables: Merged cells, multi-row headers, and cross-column spans cause complete structural collapse.
Mathematical formulas: Dense mathematical symbols, subscripts, and superscripts render traditional OCR virtually useless.
Vertical text: Aside from specialized engines like Kraken, mainstream OCR tools suffer catastrophic accuracy drops when facing vertically typeset or right-to-left text.

A financial report with complex tables, a math textbook dense with equations, a classical text in vertical typesetting — for traditional OCR, these are missions impossible.

2. The Other Path: Multimodal LLMs — Powerful but Imperfect

With the advent of large language models, multimodal LLMs (such as GPT-4o, Gemini 2.0 Flash, and Claude 3.5 Sonnet) have demonstrated remarkable "vision-to-language" capabilities.

2.1 Semantic Understanding: The MLLM Advantage

In the OmniDocBench benchmark (accepted at CVPR 2025), researchers evaluated models across 9 document types, 4 layout structures, and 3 languages using 1,355 PDF pages. Key findings:

Gemini 2.0 Flash leads in OCR and Visual Question Answering (VQA) tasks, achieving 43.4% higher accuracy than Mistral OCR, a model specifically trained for OCR.
GPT-4o and Qwen 2.5 VL reach approximately 75% accuracy on comprehensive document understanding benchmarks.
Claude excels particularly in formula recognition tasks.

These models possess exceptional semantic comprehension. When encountering blurred or ambiguous characters, they can — like a human reader — infer the correct character from surrounding context. An ink-smudged character that traditional OCR misreads as gibberish can be correctly identified by an LLM that understands the full sentence's meaning.

2.2 But LLMs Have Their Own Achilles' Heel

Sounds like the perfect replacement? In our extensive testing, multimodal LLMs revealed fundamental structural limitations:

1. Bounding Box "Hallucinations"

When asked to parse a complex PDF autonomously, LLMs struggle to produce accurate physical coordinates (bounding boxes) for document elements. When we need to precisely preserve original formatting and accurately crop illustrations, LLMs often hallucinate coordinates — "believing" an image should be at a certain position when the actual location differs by dozens of pixels. This leads to cropped images that cut off important content or include irrelevant text regions.

2. Long-Document Consistency Challenges

For documents exceeding several dozen pages, LLMs processing page-by-page tend to lose global context: chapter numbers may become incorrect, cross-references may break, and footnotes may get confused with body text.

3. Cost and Speed Constraints

At current API pricing, using a flagship LLM to process a 300-page PDF can cost several dollars just for the OCR step alone. Adding multi-round verification further escalates costs.

The takeaway: Relying solely on traditional OCR (skeleton without soul) or purely on multimodal LLMs (soul without skeleton) cannot achieve the best document conversion quality.

3. Our Solution: A "Structure + Semantics" Fusion Architecture

Since both approaches have complementary strengths, why not combine them? In our next-generation PDF conversion engine, we designed a multi-layer collaborative hybrid recognition architecture that organically fuses the precise positioning of structural OCR with the semantic understanding of large language models.

3.1 Architecture Overview

The processing pipeline consists of three core stages:

┌──────────────────────────────────────────────────────────┐
│              Stage 1: Structural Analysis                │
│  PDF → Page Rendering (200 DPI) → Structural OCR Engine  │
│  → Physical Coordinate Extraction                        │
│  ● Paragraph boundary anchoring                          │
│  ● Table region detection                                │
│  ● Image position marking                                │
└───────────────────────┬──────────────────────────────────┘
                        ▼
┌──────────────────────────────────────────────────────────┐
│              Stage 2: Semantic Recognition                │
│  Structured Regions + Page Images → Multimodal LLM       │
│  → Semantically-aware Text Output                        │
│  ● Context-aware character recognition                   │
│  ● Formula-to-LaTeX conversion                           │
│  ● Chart content reconstruction                          │
└───────────────────────┬──────────────────────────────────┘
                        ▼
┌──────────────────────────────────────────────────────────┐
│              Stage 3: Intelligent Merging                 │
│  Structural Coords + Semantic Text → Fusion Algorithm    │
│  → High-fidelity Markdown → EPUB                         │
│  ● Coordinate-text alignment mapping                     │
│  ● Cross-page paragraph joining                          │
│  ● Table of contents normalization                       │
└──────────────────────────────────────────────────────────┘

3.2 Stage 1: Structural OCR — "Finding Positions, Defining Structure"

We employ high-precision structural OCR engines (including specialized layout analysis models like GLM-OCR) to anchor physical elements on each page:

Paragraph boundaries: Precise rectangular bounding boxes for each text block, using a 0-1000 normalized coordinate system.
Image regions: A three-layer detection mechanism — PyMuPDF vector graphics detection, embedded raster image extraction, and OCR engine layout analysis — cross-validates image positions for maximum accuracy.
Table structure: Row and column boundary identification provides the structural skeleton for subsequent semantic parsing.

A key innovation is our multi-layer cross-validation mechanism: when GLM-OCR's detected image regions differ from the primary model's detections, the system matches them using an IoU (Intersection over Union) threshold of ≥ 0.15. If the fine-grained model detects less than 50% of the primary model's detected area, the system takes their union rather than a simple replacement — ensuring that large image regions are never lost.

3.3 Stage 2: Multimodal LLM — "Understanding Content, Correcting Errors"

We feed the structurally-marked coordinates along with page images to the multimodal LLM (Gemini Flash), leveraging its semantic reasoning to handle critical tasks:

Context-aware text recognition: The LLM doesn't just "see" individual characters — it understands the semantics of entire paragraphs and pages. A blurry character is no longer an isolated pixel cluster but a linguistic unit with full context.
Mathematical formula conversion: Complex mathematical symbols are automatically converted to standard LaTeX format ( $...$ for inline, $$...$$ for display), while simple fractions (1/4, 2/3) are kept as plain text to avoid over-formatting.
Intelligent chart reconstruction: For data charts and flowcharts, the LLM can understand the logical meaning and generate structured descriptive text.
Automatic language detection: The system analyzes character distribution across 5 randomly sampled pages to automatically detect the document's language (supporting 10+ languages including Chinese, Japanese, Korean, Russian, Arabic, and more), ensuring the LLM recognizes text in the correct language rather than translating it.

To control costs, we optimize image transmission aggressively: page rendering at 200 DPI (instead of the typical 300 DPI), JPEG encoding at 85% quality, and maximum dimension capped at 2,048 pixels. This combination reduces network transfer per conversion by approximately 90% with negligible impact on recognition accuracy.

3.4 Stage 3: Intelligent Merging — The Decisive Algorithm

When structural OCR's detection boxes conflict with the LLM's reading flow, which one wins? This is the most challenging engineering problem in the entire pipeline. We developed a proprietary multi-stage result merging algorithm with the following core logic:

Batch processing: Pages are merged in batches of 40. The first batch carries full table-of-contents information (automatically extracted by scanning the first 40 pages) as a structural reference; subsequent batches carry a 500-character summary of the previous batch, the list of already-processed chapters, and style guidelines — ensuring consistency across the entire document.

Cross-page paragraph joining: When a paragraph spans two pages, the algorithm automatically detects and joins it into a complete paragraph, performing only concatenation without rewriting any of the original text.

Heading level normalization: OCR-output heading levels are frequently inaccurate (H1/H3 confusion is the norm). The merging algorithm uses the table of contents as the authoritative source and enforces correct heading levels.

Footnote dual-classification: The system intelligently distinguishes two types of footnotes — academic footnotes with content (converted to [^N] format) and superscript-only reference markers (preserved as <sup>N</sup> format) — preventing the generation of empty footnote references.

Checkpoint recovery: After each batch completes, results are automatically persisted. If the process encounters API timeouts or service anomalies, the system can resume from the last successful batch without starting over.

4. Engineering in Practice: Lessons from the Optimization Journey

Getting from architectural design to production-ready took us through extensive trial and error. Here are the key engineering decisions and the reasoning behind them:

4.1 Model Selection: No Silver Bullet

We systematically benchmarked all major open-source OCR models and proprietary multimodal LLMs, building a diverse test corpus spanning multiple languages and layout densities. Our final "golden combination":

Page-level OCR: Gemini Flash (fast response, strong semantic understanding, moderate cost)
Fine-grained layout analysis: GLM-OCR (high coordinate precision for image regions)
Multi-page merging and deep understanding: Gemini Pro (supports up to 65,536 output tokens, suitable for long-document batch merging)

We also built a multi-layer fallback chain: when the primary model's API is unavailable, the system automatically switches to backup providers, ensuring high availability.

4.2 Three-Tier Image Cropping Strategy

Precise image cropping is crucial for EPUB quality. We implemented a three-tier priority strategy:

Priority	Method	Use Case	Advantage
1	Direct extraction (xref)	Embedded raster images in PDF	Zero-loss, original resolution
2	Region rendering (Clip)	Native PDF vector regions	High fidelity, vector support
3	Coordinate cropping (Bbox)	LLM-detected regions	Flexible fallback

Cropped images undergo additional optimization: maximum width of 1,200 pixels, JPEG quality at 80%, and total file size capped at 1MB. Oversized images are iteratively scaled to 80% until they meet constraints.

4.3 Anomaly Detection and Auto-Recovery

OCR occasionally produces abnormally short output on certain pages (e.g., "silent pages" with only a few characters recognized). We implemented automatic anomaly detection: when single-page OCR output falls below 50 characters, the system triggers an automatic retry, takes the longer of the two results, and merges token usage statistics from both attempts.

4.4 Concurrency and Performance

To balance speed against cost, we carefully tuned concurrency parameters:

Page-level concurrency: Up to 5 pages processed simultaneously (avoiding API rate limits)
GLM-OCR concurrency control: 1 concurrent request per API key (semaphore-based), with exponential backoff retry (2s to 30s)
API timeouts: 2 minutes for standard operations, 10 minutes for long operations (e.g., large batch merging)
Maximum token output: 16,384 tokens for page-level OCR, 65,536 tokens for batch merging

5. Real-World Results: Before and After

After continuous optimization, our hybrid architecture delivers significant improvements across challenging document types:

Document Type	Traditional OCR	Pure LLM	Our Hybrid Approach
Two-column academic papers	Left/right columns interleaved, reading order broken	Text accurate but image positions shifted	Correct reading order + precise image cropping
Math textbooks	Formulas almost entirely garbled	Formulas correct but inline/display confusion	Correct LaTeX conversion + consistent formatting
Vertically typeset classical texts	Reading direction completely wrong	Text correct but original layout lost	Correct recognition + structure preservation
Financial reports	Complex tables completely collapsed	Table content accurate but coordinates shifted	Precise table structure + correct data
Scanned legacy documents	Abundant garbled text and recognition errors	Semantic inference corrected most errors	Very low error rate + complete text-image correspondence

6. Conclusion: Unifying Skeleton and Soul

Traditional OCR gives us the document's "skeleton" — precise coordinates, rigorous structure, and reliable physical positioning. Multimodal LLMs give the document a living "soul" — deep semantic understanding, intelligent inference of ambiguous information, and structured reconstruction of complex content.

In this era of rapidly evolving AI, our experience proves that the path to delivering the best product lies not in blindly worshipping any single technology, but in engineering the right combination of complementary approaches.

The next time you encounter a complex, image-rich PDF and want to transform it into an EPUB ebook that flows beautifully on your Kindle — behind the scenes, it's this fusion engine of traditional OCR and multimodal LLMs working together to make it happen.

Data sources:

Breaking Through the Document Recognition Bottleneck: How Traditional OCR and Multimodal LLMs Work Better Together

Breaking Through the Document Recognition Bottleneck: How Traditional OCR and Multimodal LLMs Work Better Together

0. Introduction: A Multi-Billion Dollar Problem

1. The Numbers Don't Lie: Where Traditional OCR Falls Short

1.1 Character Error Rate (CER) Comparison

1.2 Complex Layouts: Where Things Fall Apart

2. The Other Path: Multimodal LLMs — Powerful but Imperfect

2.1 Semantic Understanding: The MLLM Advantage

2.2 But LLMs Have Their Own Achilles' Heel

3. Our Solution: A "Structure + Semantics" Fusion Architecture

3.1 Architecture Overview

3.2 Stage 1: Structural OCR — "Finding Positions, Defining Structure"

3.3 Stage 2: Multimodal LLM — "Understanding Content, Correcting Errors"

3.4 Stage 3: Intelligent Merging — The Decisive Algorithm

4. Engineering in Practice: Lessons from the Optimization Journey

4.1 Model Selection: No Silver Bullet

4.2 Three-Tier Image Cropping Strategy

4.3 Anomaly Detection and Auto-Recovery

4.4 Concurrency and Performance

5. Real-World Results: Before and After

6. Conclusion: Unifying Skeleton and Soul

Ready to Convert Your PDF?

Related Articles

How to Convert PDF to EPUB: 5 Methods Compared (2026)

Best PDF to EPUB Converters in 2026: Tested with Real Output Comparisons

Best Free PDF to EPUB Converters Online in 2026: Honest Comparison