Writing

When the audience is half-LLM

For most of the history of technical documentation, the reader was a human. Now they often aren’t, or not entirely. A meaningful fraction of the people who would have read docs.pytorch.org never see the page. They see whatever an LLM pulled out of it: a paragraph in a chat reply, a snippet behind a code completion, a chunk buried inside a longer answer. The page is still there. The reader is not.

This breaks a lot of conventional doc wisdom in ways that aren’t obvious until you start looking.

Chunks, not pages

The first thing it breaks is information architecture as we usually mean it. Hierarchy, sidebars, breadcrumbs, those are for people who navigate. Models don’t navigate, they retrieve. They grab the chunk that scored highest against the query and run with it. So the question is no longer “how do I help a reader find this page” but “is each chunk, severed from its page, still useful?” If your conceptual overview lives 800 words above the code sample the model retrieved, the code arrives without the concept, and the model fills the gap with whatever it remembers from pretraining. That gap is where hallucinations live.

HUMAN READER Sees the whole page MODEL Sees one retrieved chunk
A page with surrounding context is not the same artifact as the single chunk a retriever returns.

Recall doesn’t measure what you think

The second thing it breaks is what success looks like. A paper I co-authored last year, The 99% Success Paradox, made this concrete. Standard retrievers report >99% success rates on common benchmarks while actually selecting documents at random. The dashboards look fine. The answers don’t.

Same retriever, two metrics COVERAGE AT K=100 100% 0 >99% SELECTIVITY (BOR) ~5 bits 0 ≈ 0 On 20 Newsgroups, BM25 and SPLADE report >99% coverage at K=100 while chance-corrected selectivity sits near zero.
The same retriever scores near-perfect by one metric and near-random by another. Which one you watch decides what you ship.

Recall is not the right scoreboard for docs that feed an LLM, and a lot of teams haven’t realized that yet. You need a chance-corrected metric, or you need an eval set where the gold answer is wrong if the wrong chunks are retrieved. Otherwise you’re flying blind on the failure mode that matters most.

Repetition is load-bearing now

The third thing it breaks is the style guide. “Don’t repeat yourself” was good advice for readers who could scroll back. It is bad advice for readers who only see the fragment they retrieved. A chunk that says “as described in the previous section” is useless out of context. A chunk that briefly re-states the context is useful in any setting. Repetition is load-bearing now. So is version awareness, code blocks that include their imports, and prose that names the API instead of aliasing it as “the function.”

Drift matters more

If a model is going to ground its answer in your docs, the bar for accuracy moves up. A stale code sample that a human would notice and work around becomes an authoritative-sounding lie when it’s quoted inside a chatbot reply. The half-life of “good enough” doc content is shorter than it used to be.

What stays the same

None of this is a call to rewrite everything. Most of what makes docs good for humans makes them good for models too: precise language, working code, a clear answer to the question someone actually asked. What changes is the assumptions you can lean on. You can no longer assume the reader will see what’s around the thing they retrieved. You can no longer trust your old metrics. And you can no longer treat the chunk as the unit of writing while treating the page as the unit of quality.

At PyTorch I treat the doc corpus as something an LLM is going to read at scale, every day, on behalf of developers I’ll never see. That changes what gets prioritized in audits, what gets added to the publishing pipeline, and which failure modes I worry about most. The job is still writing docs. The reader changed.