一份根据读者身份而改变内容的 PDF
A PDF that changes based on how its read

原始链接: https://sgaud.com/texts/pdf

PDF 本质上是视觉化的,其存储的是坐标而非结构信息,这迫使大语言模型(LLM)和文本提取工具必须去推测文档原本的格式。这会导致常见的提取错误,例如句子断裂和层级丢失。 为了解决这一问题,作者利用了一项历史悠久但未被充分利用的 PDF 规范特性:标记内容的替换文本。通过将结构化的 Markdown 嵌入到这一隐藏层中,文档变得具有“自适应性”。 当人类打开 PDF 时,看到的是标准且格式完美的文档;而当机器或大语言模型提取文本时,它们获取的则是清晰的结构化 Markdown,而非原始的视觉数据。这种方法无需新的文件扩展名,也不需要维护两份文件;同一个文件会根据读取者的不同提供不同的输出。测试证实,主流的提取器和大语言模型都能正确解读这种嵌入结构,在不增加令牌(token)数量或改变人类读者所见视觉效果的前提下,显著提高了人工智能工作流的信息密度和可靠性。

Hacker News 上的一场讨论探讨了“自适应 PDF”(Adaptive PDFs),这是一个旨在优化 PDF 内容以供大语言模型(LLM)提取,而非供人类阅读的项目。尽管创建者称 PDF 会“根据阅读者身份发生变化”,但评论者指出,文档的视觉内容保持不变,其底层结构为 AI 代理提供了机器可读的输出(Markdown)。 该讨论串强调了几个关键点: * **“代理时代”:** 用户认为,针对 AI 代理优化文档的重要性,正变得如同从桌面端向移动端转型一般。 * **安全风险:** 批评人士警告称,该技术可能被武器化。通过在 PDF 中嵌入隐藏指令,恶意行为者可以对自动化系统进行提示词注入攻击(例如,诱导正在处理电费账单的大语言模型执行未经授权的操作)。 * **效率问题:** 一些人认为,这种方法只是解决了将结构化数据转换为 PDF,随后又费力将其提取出来的历史性低效问题。 该项目链接为 [github.com/iminoaru/adaptivepdf](https://github.com/iminoaru/adaptivepdf),引发了一场更广泛的辩论:即同样的技术最终是否可用于创建“仅限人类”阅读的文档,以抵御 AI 抓取。
相关文章

原文

PDF is a visual format. It stores instructions for where to draw glyphs on a page. The spec does support Tagged PDF, a structure tree that marks headings, paragraphs, lists. Some domains use it like government accessibility mandates, enterprise publishing pipelines. But most PDFs you actually encounter are untagged. LaTeX, Chrome's print-to-PDF, most export tools don't produce tags. So what you get is coordinates and font sizes. Text extractors read the draw commands left to right, top to bottom, and hope for the best.

This didn't matter when humans were the only readers. But now most PDFs end up in an LLM. We upload them to ChatGPT, ask Claude to summarize them, pipe them through parsers. And every single one of these tools is fighting the same problem: reconstructing structure from a format that never carried it. An LLM sees Project Alpha\nLed a team of 5 engineers\nto deliver the and has to guess where the heading ends and the sentence continues. Sometimes it gets it right. Often it doesn't.

I wanted to make a PDF where humans see the formatted document but machines extract clean markdown. Same file, no new extension. Just a .pdf.

How It Works

There is a property in the PDF spec (since PDF 1.4, 2001) that lets you define replacement text for marked content. Renderers ignore it, they draw whatever the content stream says. But text extractors that support it return the replacement instead of the visual text. In my testing, PyMuPDF and Poppler both honored it. Support varies across tools and versions, but the major open source extractors handle it.

It was designed for things like ligatures and characters that don't naturally map to Unicode. A visual glyph "fi" should extract as two characters "f" and "i" It never got adopted for anything larger.

We use it at the document level. We attach replacement text to the content stream via marked-content sequences, so extractors that support the property return structured markdown instead of raw visual text. The PDF renders identically one file, two completely different outputs depending on who's reading it.

What Extractors Actually See

Same PDF, same visual appearance. Here's what PyMuPDF extracts from each.

Normal PDF:

Quarterly Infrastructure Report
Overview
Cloud migration completed ahead of sch
edule. Three critical services were
moved to the new cluster.
Key Metrics
Uptime: 99.97%
Latency: 42ms avg (down from 68ms)
Cost: $12,400/mo (down 34%)
Action Items
Migrate remaining batch jobs by Q3
Set up automated failover for db-west
Review cost allocation per team

Smart PDF:

# Quarterly Infrastructure Report

## Overview

Cloud migration completed ahead of schedule. Three critical services were moved to the new cluster.

## Key Metrics

| Metric  | Value                     |
|---------|---------------------------|
| Uptime  | 99.97%                    |
| Latency | 42ms avg (down from 68ms) |
| Cost    | $12,400/mo (down 34%)     |

## Action Items

- Migrate remaining batch jobs by Q3
- Set up automated failover for db-west
- Review cost allocation per team

Both files look identical in Preview, Adobe, any PDF viewer. But the normal extraction has no hierarchy, broken line wraps mid-sentence, bullet points indistinguishable from paragraphs, and a table flattened into lines. The smart extraction has # headings, markdown tables, - bullets, and sentences that don't break mid-word. An LLM doesn't have to guess that "Key Metrics" is a section header or that those three lines are a list. It's explicit.

Benchmarks

Converted several PDFs to smart PDFs using our tool, then extracted text from both versions using PyMuPDF's get_text() and https://www.pdf2go.com/ seaparately, both returned markdown. Token counts via tiktoken (cl100k_base). Benchmark script is in the repo.

DocumentPagesSize ΔNormal TokenSmart Token
Resume1+15.7%650668
Textbook417-8.5%193,064195,858
Novel Chapter38+4.7%16,47215,958
Research paper18+2.5%8,0827,897

Token counts are roughly the same. The advantage isn't fewer tokens. It's that the same tokens now carry structure. ## Overview and Overview cost the same, but one tells the machine what it's looking at. The information density per token goes up without the token count going up.

Size overhead is single digit percent for most files. The textbook shrunk because PyMuPDF's save with garbage=3 removes unused PDF objects, that's a general optimization, not specific to the technique.

Uploaded smart PDFs to both ChatGPT and Claude. Asked them to copy-paste the exact raw text they see, character for character. Both returned markdown : #, ##, - bullets. This isn't fully conclusive on its own since LLMs do structural inference and tools like Docling can produce markdown from normal PDFs via layout analysis. But the output matched our embedded layer exactly, including formatting choices no layout heuristic would reproduce identically.

An Adaptive Document

What you end up with is a document that adapts to its reader. A human opens it and sees the formatted PDF they're used to. Fonts, layout, spacing, everything normal. A machine reads it and gets clean markdown. Headings, lists, structure. One file, no separate versions, no conversion step. It just works depending on who's looking.

You don't manage this. You don't maintain two copies. The document itself decides what to present based on how it's being consumed.

I'm actively exploring more about this and looking towards developing an extension for google doc to streamline this. This was my very first iteration on this idea.

联系我们 contact @ memedata.com