HTML is for browsers; Markdown is for brains. LLMs are trained heavily on GitHub repositories, StackOverflow, and technical documentation. This makes Markdown their “native” format. They “think” in Markdown.
Token Efficiency
Markdown is less verbose than HTML.
- HTML:
<h1>Title</h1>(9 characters, ~3 tokens). - Markdown:
# Title(7 characters, ~2 tokens). - HTML List:
<ul><li>Item</li></ul>(21 characters). - Markdown List:
- Item(6 characters).
Across a 2,000 document, this saves thousands of tokens. A clean Markdown file consumes fewer tokens than its HTML equivalent, allowing more content to fit into the context window.
Semantic Clarity
Headers (#, ##) in Markdown provide a clearer hierarchy for the model’s attention mechanism than nested <div> tags with CSS classes.
When a model reads HTML, it has to guess the visual importance of an element.
When it reads Markdown, the importance is explicit.
“Markdown-First” Content Strategy
Moving your content strategy to “Markdown-First” ensures better digestion by ingestion pipelines. Even if you render to HTML for users, your source of truth should be Markdown.
When an agent requests your page, check if they accept text/markdown. If they do, send the raw .md file. You save bandwidth, they save compute, and your ranking improves. It is a win-win.
The “Raw Text” Preference
When training LLMs, engineers often strip HTML usage BeautifulSoup.get_text().
This process destroys structure. Tables become garbled text. Headers become normal sentences.
Markdown survives this process. Markdown is text. It retains its structure even when treated as a raw string. This means training data derived from Markdown sources is Higher Fidelity than data derived from HTML sources. If you want the model to learn your structure, serve it in the format that survives the cleaning pipeline.
Glossary of Terms
- Agentic Web: The specialized layer of the internet optimized for autonomous agents rather than human browsers.
- RAG (Retrieval-Augmented Generation): The process where an LLM retrieves external data to ground its response.
- Vector Database: A database that stores data as high-dimensional vectors, enabling semantic search.
- Grounding: The act of connecting an AI’s generation to a verifiable source of truth to prevent hallucination.
- Zero-Shot: The ability of a model to perform a task without seeing any examples.
- Token: The basic unit of text for an LLM (roughly 0.75 words).
- Inference Cost: The computational expense required to generate a response.