The Hardest Part of PDF to EPUB Nobody Talks About: Footnotes and Endnotes

0. Where Did All the Notes Go?

Pick up any academic book, history monograph, or annotated translation. 15–30% of the actual content is in the footnotes and endnotes. A typical non-fiction book has 200–500 annotations; heavy academic stuff goes over 1,000.

Now throw that PDF into Calibre or any online converter and make an EPUB. What happened to the notes?

Gone. Or worse — they turn into random text fragments scattered through your ebook, with the reference anchors broken and no way to jump between the superscript and the explanation.

This isn't some edge case. It's the default behavior of basically every PDF-to-EPUB tool out there.

The reason is straightforward: PDF is page-based, EPUB is flow-based. In a PDF, a footnote sits at the bottom of page 47. In EPUB, there's no "page 47" — content reflows based on screen size and font settings. The spatial link between a superscript "¹" and the note at the page bottom simply doesn't exist in EPUB's DOM.

You can't fix this with just text extraction. You need to actually understand what annotations are, how they're organized across the whole book, and then rebuild them in a completely different format.

We built a 6-stage pipeline to do exactly that. Here's the full breakdown.

1. First Things First: Notes Come in Many Flavors

Before you can process annotations, you need to figure out what you're dealing with. After looking at thousands of real PDFs across languages and genres, we found two independent classification axes:

1.1 Where Notes Live (5 Types)

Style	What it means	Typical books
`footnote`	Definitions at the same page bottom	Most humanities textbooks
`endnote_book`	All notes in a dedicated chapter at book end	Academic monographs, translations
`endnote_chapter`	Notes at the end of each chapter	Some social science texts
`mixed`	Both footnotes and endnotes in the same book	Translator notes + author endnotes
`none`	No annotations	Novels, kids' books

1.2 What Reference Marks Look Like (4 Types)

Format	Looks like	Unicode Range
`superscript_number`	¹ ² ³ ⁴ ⁵	U+00B9, U+00B2, U+00B3, U+2074–U+2079
`circled_number`	① ② ③ ④ ⑤	U+2460–U+2473
`bracket`	[1] [2] [3]	ASCII square brackets
`symbol`	* † ‡ §	Dagger/section marks

Why bother classifying? Because different combos need completely different processing. A footnote with circled_number marks needs same-page definition recovery. An endnote_book with superscript_number marks needs pre-extraction of the notes chapter plus cross-chapter link generation. One-size-fits-all is guaranteed to fail.

2. Stage 1: Detection During OCR (Zero Extra Cost)

First challenge: detecting annotation info on each page. Traditional OCR just spits out plain text — it has no idea what's a footnote reference vs. body text.

Our approach: piggyback on the existing OCR call. We're already sending each page image to a multimodal LLM (Gemini Flash) for OCR, so we just add a few lines to the prompt asking it to append a structured metadata block:

<!--NOTES_META
has_note_refs: true/false
has_note_defs: true/false
ref_format: superscript_number/circled_number/bracket/symbol/none
def_count: 3
is_notes_section: false
-->

Five fields, each grabbing a specific signal:

has_note_refs: Any annotation reference marks in the body text?
has_note_defs: Any annotation definitions on this page?
ref_format: What format are the reference marks?
def_count: How many definitions on this page?
is_notes_section: Is the whole page part of a dedicated notes chapter?

Why This Works

The LLM already "sees" the full page image and understands its layout. Asking it to classify annotations is just a few extra lines in the prompt — no extra API call, no extra image upload, and the extra output is ~50 tokens vs. ~2,000 for the OCR content itself. Marginal cost is effectively zero.

The metadata is wrapped in an HTML comment, so it's invisible to downstream Markdown processing. Parsing is a one-liner:

_META_PATTERN = re.compile(r'<!--NOTES_META\s*\n(.*?)-->', re.DOTALL)

After parsing, strip the block from the OCR text and you've got clean content for the next stages.

3. Stage 2: Aggregate and Vote

Per-page metadata is noisy — the LLM will occasionally get a page wrong. The aggregation layer uses statistical heuristics to make a robust book-level call.

3.1 The Decision Tree

It's basically a decision tree with empirically-tuned thresholds:

Input: page_metas (per-page metadata), total_pages

1. ref_pages  = pages where has_note_refs = true
2. def_pages  = pages where has_note_defs = true
3. notes_section_pages = pages where is_notes_section = true

4. No ref_pages → "none", done

5. If notes_section_pages exist:
   a. threshold = total_pages × 0.7
   b. end_pages = notes_section_pages where page_num ≥ threshold
   c. |end_pages| ≥ |notes_section_pages| × 0.7 → "endnote_book"
   d. Otherwise → "endnote_chapter"

6. No def_pages → "unknown" (refs exist but no definitions found)

7. co_occur = ref_page_nums ∩ def_page_nums
   |co_occur| ≥ |ref_page_nums| × 0.5 → "footnote"

8. None of the above → "mixed"

3.2 Why These Thresholds

70% position (step 5a): In a 300-page book, notes section pages need to be after page 210 to count as "back of book." This avoids misclassifying mid-book glossaries or bibliographies.

70% concentration (step 5c): At least 70% of detected notes-section pages must fall in that back-of-book zone. Tolerates a few pages being misidentified by OCR — majority vote wins.

50% co-occurrence (step 7): If half or more of the pages with annotation references also have definitions on the same page, it's footnotes. Robust against the LLM occasionally missing a definition or reference.

3.3 Reference Format: Majority Vote

Simple Counter.most_common():

ref_formats = [m.ref_format for m in ref_pages if m.ref_format != 'none']
format_counter = Counter(ref_formats)
primary_ref_format = format_counter.most_common(1)[0][0]

Most frequent format wins. Handles occasional per-page misidentification.

4. Stage 3: Endnote Pre-Extraction (Regex First, AI Fallback)

For endnote_book documents, the notes chapter needs to be extracted and structured before the main text merge — so those pages can be excluded from merge input (no duplication) and the definitions are ready for reconciliation later.

4.1 The Two-Stage Strategy

Regex parsing (fast, deterministic, free)
  ↓ unreliable?
AI parsing (Gemini Flash, structured JSON)

Why not just use AI? Because regex is free, deterministic, and fast. For well-formatted notes chapters — which is most books — regex is all you need. Save the API call for the messy ones.

4.2 Regex: 4 Heading Patterns, 4 Number Formats

The parser recognizes chapter sub-headings within the notes section:

sub_heading_re = re.compile(
    r'(?:^#{1,4}\s+(.+?)$)'                    # ### Chapter Name
    r'|(?:^\*\*(.+?)\*\*\s*$)'                  # **Chapter Name**
    r'|(?:^(第[一二三四五六七八九十百千\d]+[章节篇部]'
    r'(?:\s+.+?)?)$)'                            # 第X章 Title (CJK)
    r'|(?:^((?:Chapter|Part)\s+\d+.*)$)',        # Chapter N...
    re.MULTILINE | re.IGNORECASE
)

And 4 note numbering formats:

note_pattern = re.compile(
    r'(?:^|\n)\s*(\d+)\s*[.、）)\s]\s*(.+?)(?=\n\s*\d+\s*[.、）)\s]|\n\n|\Z)',
    re.DOTALL
)

Matches 1. content, 1、content (Chinese period), 1）content (Chinese paren), and 1 content (space-delimited).

4.3 Safety Valve: Less Than 3 Notes = Bail

If the regex extracts fewer than 3 notes, it returns None and triggers the AI fallback. This prevents numbered lists in body text from being misidentified as notes.

4.4 AI Fallback: Structured JSON

When regex can't handle it, we send the notes-section text to Gemini Flash and ask for structured JSON:

{
  "chapters": {
    "Chapter 1": {
      "1": "Note content for reference 1",
      "2": "Note content for reference 2"
    },
    "Chapter 2": {
      "1": "Note content for reference 1"
    }
  }
}

The prompt includes the book's table of contents for reference, so the LLM can match note sub-headings to actual chapter names. Text input is capped at 100,000 characters to stay within context limits.

4.5 Fuzzy Chapter Matching (3 Tiers)

Chapter titles in the notes section almost never exactly match the main text. The EndnoteDatabase handles this with 3-tier matching:

1. Exact match: "Chapter 1: The Beginning" == "Chapter 1: The Beginning"
2. Normalized: strip whitespace/punctuation, then compare
3. Substring: "Chapter 1" ⊂ "Chapter 1: The Beginning" (min length ≥ 2)

Handles things like "Introduction" in the main text vs. "Intro" in the notes.

5. Stage 4: Tell the AI Merge How to Handle Notes

During the Markdown merge phase (where per-page OCR gets consolidated into one coherent doc), we inject annotation-specific instructions into the AI prompt based on the detected note_style.

5.1 The Key Decision: Dual Notation

Two different markup formats, one for each note type:

Note Type	Markup	Why
Footnote (definition on same page)	`[^N]` + `[^N]: definition`	Standard Markdown footnote — definition stays with reference
Endnote (definition elsewhere)	`<sup>N</sup>`	Raw HTML placeholder — reconciled after merge

5.2 Different Prompts for Different Styles

endnote_book / endnote_chapter:

Superscript numbers (¹ ² ³) → use N, do NOT convert to [^N] Circled numbers (①②③) with same-page definitions → use [^N] with definitions at batch end Don't try to find or make up content for endnote references.

footnote:

Convert reference marks to [^N] Put definitions at batch end as [^N]: content Keep numbering unique and continuous

mixed:

Page-bottom definition exists → [^N] No page-bottom definition → N

The whole point of the dual system is to prevent a nasty failure mode: the AI hallucinating footnote definitions for endnote references — making up content that simply doesn't exist on the current page.

6. Stage 5: Post-Merge Reconciliation

After merging produces a complete Markdown document, two reconciliation passes recover and link the annotations.

6.1 Recovering Dropped Footnotes

The AI merge sometimes drops footnote definitions ([^N]: content) while keeping the references ([^N]). Recovery is straightforward:

1. Scan merged markdown for [^N] references → referenced_ids
2. Scan merged markdown for [^N]: definitions → defined_ids
3. orphaned = referenced_ids - defined_ids
4. For each orphaned ID:
   a. Search raw OCR texts for the matching [^N]: definition
   b. Also look for circled-number format (①②③)
   c. Found it? Append to merged markdown

The raw OCR output (before AI merging) is the ground truth. The AI may have reorganized or dropped things, but the original OCR text has them verbatim.

6.2 Endnote Reconciliation: The Cross-Chapter Problem

Endnotes are trickier. The reference (3 in Chapter 2) and the definition (note #3 in the notes chapter) are in different parts of the document. After EPUB chapter splitting, they'll be in separate XHTML files. Standard Markdown footnotes need reference and definition in the same block — no cross-chapter support.

Our solution: HTML data attributes as a cross-chapter linking protocol.

Step 1: For each <sup>N</sup>, look up the definition in
        EndnoteDatabase (by chapter + note number)

Step 2: Replace with:
        <sup data-en-id="42" data-en-num="3">3</sup>

Step 3: Collect all definitions into a ## Notes section:
        <div data-endef-id="42">3. Definition text here</div>

Step 4: Global auto-increment ID (en_counter) prevents
        collisions when different chapters reuse the same
        note numbers

Why data attributes? Because markdown.markdown() preserves raw HTML. These tags pass through Markdown rendering untouched, then get converted to real clickable links during EPUB generation.

6.3 Two-Tier Matching

Tier 1: Chapter-aware
  - Extract chapter structure from merged markdown
  - For each chapter, look up matching notes in EndnoteDatabase
  - Fuzzy chapter name matching (exact → normalized → substring)

Tier 2: Global fallback
  - <sup>N</sup> references not matched by Tier 1
  - Flatten all EndnoteDatabase chapters into one pool
  - Match by note number only, ignore chapter

Tier 2 is the safety net for when OCR produces slightly different chapter names than the notes section.

6.4 Tracking Match Rates

Every reconciliation run logs diagnostics:

pre_sup_count = len(re.findall(r'<sup>(\d+)</sup>', markdown))
# ... after reconciliation ...
post_sup_count = len(re.findall(r'<sup>(\d+)</sup>', result))
match_rate = (pre_sup_count - post_sup_count) / pre_sup_count * 100
print(f"Match rate: {match_rate:.1f}%")

In production, well-structured books hit 85–95% match rates. The misses are mostly OCR errors in note numbers or chapter names that are just too different.

6.5 Cleaning Up Leaked Notes Text

Even though endnote pages are excluded before merging, the AI sometimes still pulls in some notes content. After reconciliation, we clean it up with a position-based safety check:

# Only remove if the notes heading appears after 30% (endnote_book)
# or 50% (other styles) of the document
position_threshold = 0.3 if notes_meta.note_style == 'endnote_book' else 0.5
if heading_match.start() < len(markdown) * position_threshold:
    return markdown  # Don't touch it — probably a real chapter

This prevents nuking a chapter called "Notes on Method" that shows up early in a social science book.

6.6 Footnotes in Headings

One more gotcha: footnote references inside chapter headings. After EPUB splitting, ## Introduction[^1] creates a cross-chapter footnote that will break.

Fix: convert heading footnotes to inline italic text:

Before: ## Introduction[^1]
         [^1]: Written in 1985

After:  ## Introduction
        *Written in 1985*

Simple, but it works.

7. Stage 6: EPUB Output with Clickable Links

Last step: turn the processed Markdown into EPUB-standard interactive footnotes and endnotes.

7.1 Footnotes: Python-Markdown Does the Work

Standard footnotes ([^N] / [^N]: definition) go through the Python-Markdown footnotes extension, which outputs EPUB-compatible HTML:

<!-- Reference in body -->
<sup class="footnote-ref"><a href="#fn-1">1</a></sup>

<!-- Definition at chapter end -->
<div class="footnote">
  <ol>
    <li id="fn-1"><p>Definition text <a class="footnote-backref" href="#fnref-1">↩</a></p></li>
  </ol>
</div>

Tap the superscript, jump to the note. Tap ↩, jump back. Bidirectional.

7.2 Endnotes: Custom `endnotes.xhtml`

Endnotes can't use the footnotes extension because references and definitions are in different chapter files. The EPUB generator builds it from scratch:

Scan all chapter HTML for data-en-id attributes → build {en_id: chapter_filename} map
Extract definitions from the notes chapter's data-endef-id divs

Convert references to cross-file links:

<sup class="endnote-ref">
  <a id="enref-42" href="endnotes.xhtml#endef-42">[3]</a>
</sup>

Generate endnotes.xhtml with back-links:

<div class="endnote-item" id="endef-42">
  <p class="endnote-text">
    <span class="endnote-num">[3]</span> Definition text
    <a class="endnote-backref" href="chapter_02.xhtml#enref-42">↩</a>
  </p>
</div>

Add to EPUB spine and table of contents

7.3 CSS + Dark Mode

Both note types get dedicated styling tuned for e-readers:

/* Footnote ref — small, stays out of the way */
.footnote-ref {
    font-size: 0.75em;
    vertical-align: super;
    line-height: 0;
}

/* Footnote section — separated from body */
div.footnote {
    margin-top: 2em;
    padding-top: 1em;
    border-top: 1px solid #ccc;
    font-size: 0.9em;
}

/* Endnote ref link */
.endnote-ref a {
    color: #0066cc;
    font-size: 0.75em;
    vertical-align: super;
}

Dark mode handled:

@media (prefers-color-scheme: dark) {
    div.footnote { border-top-color: #555; }
    .endnote-ref a,
    .endnotes-section .endnote-num,
    a.endnote-backref { color: #6db3f2; }
}

8. Lessons Learned

8.1 Regex First + AI Fallback > Pure AI

We use regex as the primary endnote parser, AI as fallback. Not because regex is "better" — but because the cost-reliability tradeoff is massively in regex's favor for structured data:

Approach	Cost per book	Determinism	Latency
Regex only	$0.00	100% deterministic	< 10ms
AI only	~$0.02	Non-deterministic	~3 seconds
Regex → AI fallback	$0.00–0.02	Deterministic when possible	10ms–3s

In practice, regex handles ~70% of books just fine. Most of the time it's both faster and more reliable.

8.2 Position Threshold Prevents Nuking Real Chapters

Deleting a notes section from the wrong spot can destroy real content. The position threshold (30–50% of document length) is a simple safety check. A heading called "Notes" on page 15 of a 300-page book is almost certainly a real chapter. The same heading on page 280 is almost certainly endnotes.

8.3 Fixing Broken AI-Generated HTML

The AI merge occasionally spits out mangled HTML — half-truncated tags, dangling attributes:

</sup>-id="44" data-en-num="7">7</sup>

A targeted regex cleans just these broken patterns without touching anything valid:

pattern = r'</(\w+)>-?[\w-]+="[^"]*"(?:\s+[\w-]+="[^"]*")*>[^<]*</\1>'

Only matches structurally damaged tags (closing tag + orphaned attributes). Won't touch normal HTML.

8.4 Checkpoint Everything

The whole pipeline supports checkpointing. Notes metadata and the endnote database get persisted to cloud storage right after extraction. If the merge step dies halfway through a 500-page book, it picks up from the last successful batch. No re-running OCR, no re-extracting notes.

9. The Full Pipeline

┌─────────────────────────────────────────────────────────────┐
│  Stage 1: OCR + Detection                                   │
│  Page image → Gemini Flash → Markdown + NOTES_META          │
│  Extra cost: $0 (piggybacks on OCR)                         │
└──────────────────────────┬──────────────────────────────────┘
                           ▼
┌─────────────────────────────────────────────────────────────┐
│  Stage 2: Aggregate & Classify                              │
│  Per-page metadata → Statistical voting → NotesMeta         │
│  Output: note_style, primary_ref_format, section_pages      │
└──────────────────────────┬──────────────────────────────────┘
                           ▼
┌─────────────────────────────────────────────────────────────┐
│  Stage 3: Endnote Pre-Extraction (endnote_book only)        │
│  Notes pages → Regex → AI fallback → EndnoteDatabase        │
│  Safety: < 3 notes extracted = switch to AI                 │
└──────────────────────────┬──────────────────────────────────┘
                           ▼
┌─────────────────────────────────────────────────────────────┐
│  Stage 4: Merge with Note Instructions                      │
│  Inject rules into AI merge prompt per note_style           │
│  Footnotes → [^N], Endnotes → <sup>N</sup>                 │
└──────────────────────────┬──────────────────────────────────┘
                           ▼
┌─────────────────────────────────────────────────────────────┐
│  Stage 5: Post-Merge Reconciliation                         │
│  Recover dropped footnotes + Match endnotes + Fix headings  │
│  Cross-chapter links via data-en-id / data-endef-id         │
└──────────────────────────┬──────────────────────────────────┘
                           ▼
┌─────────────────────────────────────────────────────────────┐
│  Stage 6: EPUB Generation                                   │
│  Footnotes → Python-Markdown extension                      │
│  Endnotes → Custom endnotes.xhtml + bidirectional links     │
│  CSS with dark mode                                         │
└─────────────────────────────────────────────────────────────┘

10. Wrapping Up

Footnotes and endnotes are where most PDF-to-EPUB tools quietly fall apart — and where the gap between "extracted text" and "readable ebook" is widest. The problem spans detection, parsing, cross-chapter linking, and rendering, each with its own bag of tricks.

Our 6-stage pipeline turns what used to be manual editing work into a fully automated process. The core ideas:

Detect early, classify statistically. Per-page LLM detection + book-level voting, at zero marginal cost.
Different notes, different paths. No one-size-fits-all.
Regex first, AI fallback. Use deterministic parsing when you can, pay for AI only when you have to.
Data attributes bridge the gap. A simple HTML protocol connects Markdown processing to EPUB's multi-file reality.
Reconcile, don't prevent. AI merging will drop things. Build recovery using raw OCR as ground truth.

End result: readers get footnotes and endnotes that actually work — tap a superscript, read the note, tap back. Like a physical book, but better.

How It's Holding Up in Production

We've been running this pipeline in production and monitoring results in real time. Based on internal testing and live user conversions, the system handles the majority of books well — especially well-structured academic texts and translations where note formatting follows common conventions.

That said, it's not perfect. We still see cases where things go wrong: unusual note layouts that the classifier misjudges, OCR errors that throw off the matching, chapter names so different between the main text and notes section that even fuzzy matching can't bridge the gap. Edge cases like notes embedded in tables, multi-level nested notes, or books that switch between note styles mid-document also trip up the pipeline occasionally.

We're actively collecting these failure cases from production logs and user reports, categorizing them, and feeding fixes back into the pipeline. It's an ongoing process — each round of fixes tends to surface a new class of edge cases — but the trajectory is clearly improving.

Want to try it? Convert your PDF at PDF2EPUB.ai — footnotes and endnotes handled automatically.