When an AI bot scrapes your content for RAG (Retrieval-Augmented Generation), it doesn’t digest the whole page at once. It splits it into “chunks.” The quality of these chunks determines whether your content answers the user’s question or gets discarded.
Your HTML Header structure (H1 -> H6) is the primary roadmap for this chunking process.
The Semantic Splitter
Most modern RAG pipelines (like LangChain or LlamaIndex) use “Recursive Character Text Splitters” or “Markdown Header Splitters.” They look for # or ## as natural break points to segment the text.
Read more →HTML is for browsers; Markdown is for brains.
LLMs are trained heavily on GitHub repositories, StackOverflow, and technical documentation. This makes Markdown their “native” format. They “think” in Markdown.
Token Efficiency
Markdown is less verbose than HTML.
- HTML:
<h1>Title</h1> (9 characters, ~3 tokens). - Markdown:
# Title (7 characters, ~2 tokens). - HTML List:
<ul><li>Item</li></ul> (21 characters). - Markdown List:
- Item (6 characters).
Across a 2,000 document, this saves thousands of tokens. A clean Markdown file consumes fewer tokens than its HTML equivalent, allowing more content to fit into the context window.
Read more →Javascript-heavy sites have always been tricky for crawlers. For agents, the problem is compounded by cost. Running a headless browser to render React/Vue apps is expensive and slow.
The Economics of Rendering
- HTML Fetch: $0.0001 / page.
- Headless Render: $0.005 / page. (50x more expensive).
If you are an AI company crawling billions of pages, you will skip the expensive ones. This means if your content requires JS to render, you are likely being skipped by the long-tail of AI agents.
Read more →Modern web development loves “Hydration.” A server sends a skeleton HTML, and JavaScript “hydrates” it with interactivity and data. For AI agents, this is a nightmare.
The Cost of Rendering
Running a headless browser (like Puppeteer) to execute JavaScript and wait for hydration is computationally expensive. It allows for maybe 1 page fetch per second.
Fetching raw HTML allows for 100+ page fetches per second.
AI Agents are optimized for speed and token efficiency. If your content requires 5 seconds of JS execution to appear, the agent will likely timeout or skip you.
Read more →The metadata block at the top of a Markdown file, known as Frontmatter, is the most valuable real estate for MCP-SEO. It is structured data that sits before the content, framing the model’s understanding.
Beyond Title and Date
Most Hugo or Jekyll sites just use title and date. To optimize for retrieval, you should inject semantic richness here.
Recommended Fields
summary: A dense 50-word abstract. Agents often read this first to decide if the full document is worth processing.keywords: Explicit vector keywords. “Neuroscience, synaptic, plasticity.”entities: A list of named entities. ["Elon Musk", "Tesla", "SpaceX"].complexity: “Beginner” | “Advanced”. Helps the agent match the user’s expertise level.
Example Frontmatter
---
title: "The Physics of Black Holes"
summary: "A technical overview of event horizons and Hawking radiation."
complexity: "PhD"
entities:
- Stephen Hawking
- Albert Einstein
tags: ["Astrophysics", "Gravity"]
---
The Retriever’s Shortcut
Many RAG systems index the Frontmatter separately or weight it heaver. By putting your core concepts in key-value pairs, you are essentially hand-feeding the indexer. You are saying, “This is exactly what this file is about.”
Read more →The World Wide Web was built on HTML (HyperText Markup Language). The “HyperText” part was designed for non-linear human reading—clicking from link to link. The “Markup” was designed for browser rendering—painting pixels on a screen. Neither of these design goals is ideal for Artificial Intelligence.
When an LLM “reads” the web, HTML is noise. It is full of <div>, <span>, class="flex-col-12", and tracking scripts. To get to the actual information, the model must perform “DOM Distillation,” a messy and error-prone process. We are witnessing the birth of a new standard for Machine-Readable Content.
Read more →Cloaking—the practice of serving different content to search engine bots than to human users—has traditionally been considered one of the darkest “black hat” SEO tactics. Search engines like Google have historically penalized sites severely for showing optimized text to the crawler while displaying images or Flash to the user. However, as we transition into the era of Agentic AI, the definition of cloaking is undergoing a necessary evolution. We argue that “Agent Cloaking” is not only ethical but essential for the future of the web.
Read more →The ultimate form of “white hat cloaking” is Content Negotiation. It is the practice of serving different file formats based on the requestor’s capability.
If a request includes Accept: application/json, why serve HTML?
- Human Browser:
Accept: text/html. Serve the webpage. - AI Agent:
Accept: application/json or text/markdown. Serve the data.
The “Headless SEO” Approach
This approach creates the most efficient path for agents to consume your content without navigating the DOM.
Instead of forcing the agent to:
Read more →