In the rush to build “AI-Powered” search experiences, engineers have hit a wall. They built powerful vector databases. They fine-tuned state-of-the-art embedding models. They scraped millions of documents. And yet, their Retrieval-Augmented Generation (RAG) systems still hallucinate. They still retrieve the wrong paragraph. They still confidently state that “The refund policy is 30 days” when the page actually says “The refund policy is not 30 days.”
Why? Because they are feeding their sophisticated models “garbage in.” They are feeding them raw text stripped of its structural soul. They are feeding them flat strings instead of hierarchical knowledge.
The solution isn’t a larger context window. It isn’t a better embedding model. It is the oldest technology on the web: Semantic HTML.
The Problem with Arbitrary Chunking
When a RAG system ingests data, it must “chunk” it. An LLM cannot process an entire library at once (yet). So, engineers split documents into smaller pieces to be stored in a vector database.
The most common method is Fixed-Size Chunking. “Take every 512 tokens and make a chunk.”
This is catastrophic for meaning. Imagine cutting a book into strips of paper exactly 100 words long. You might cut a sentence in half. You might separate a question from its answer. You might sever a header from the paragraphs it governs.
Semantic Chunking is the antidote. It uses the DOM (Document Object Model) to define boundaries.
- Start a chunk at
<h2>. - End a chunk at the next
<h2>. - Keep
<ul>lists together. - Keep
<table>rows together.
But this only works if the HTML has semantic tags. If your document is just a thousand nested <div>s, the chunker has no way to know where one idea ends and another begins.
DOM-Aware Chunking: A Technical Breakdown
Let’s look at how a DOM-Aware Chunker (like the one used in OpenClaw) parses a semantic page versus a non-semantic one.
| Semantic HTML | Chunking Strategy | Result |
|---|---|---|
<article> | Root Entity | Identifying the main topic of the page. |
<h2>Return Policy</h2> | Chunk Start / Heading | “Return Policy” becomes the metadata key for the chunk. |
<p>You have 30 days...</p> | Chunk Body | The text is associated with the “Return Policy” key. |
<div class="sidebar"> | Noise Filter | Skipped or down-weighted. |
<h3>Exceptions</h3> | Sub-Chunk Start | “Exceptions” is linked as a child of “Return Policy”. |
In a non-semantic page, the chunker sees:
<div><strong>Return Policy</strong></div> <div>You have...</div>
It has to guess. is “Return Policy” a header? Or just bold text? If it guesses wrong, it might attach the text to the previous section. Suddenly, your “Shipping Policy” chunk contains the text “You have 30 days to return,” leading the AI to hallucinate that shipping takes 30 days.
Tables: The Ultimate Grounding Mechanism
We previously discussed tables in the context of training, but for Retrieval, they are even more critical.
A table is a deterministically grounded data structure. It is one of the few places in a document where ambiguity is mathematically eliminated.
- Reference: Web Content Accessibility Guidelines (WCAG) on Tables
- Reference: MDN Documentation on the Table Element
When an agent retrieves a semantic table, it doesn’t just get text; it gets coordinates.
Cell(Row=3, Col=2) is “Price: $50”.
This allows for Symbolic Reasoning on top of Vector Search. The agent can execute “SQL-like” queries on your content. “SELECT Price FROM Product WHERE Name = ‘Widget’”.
If you use <div>s to fake a table, you force the model to rely on spatial inference (token distance), which is notoriously unreliable for precise data extraction. You are essentially asking the model to “eyeball it.”
Citation & The id Attribute
One of the biggest demands for Agentic AI is Verifiability. Users want to know where the answer came from. They want a citation.
Semantic HTML provides the perfect mechanism for this: Fragment Identifiers.
If you structure your content with semantic IDs, you allow agents to deep-link directly to the proof.
<section id="refund-policy">
<h2>Refund Policy</h2>
<p>...</p>
</section>
The agent can now cite: https://example.com/page#refund-policy.
If your page is a soup of dynamic, React-generated distinct IDs (like <div id="root-3490f9">), the agent cannot reliably cite a stable section. The citation breaks on the next rebuild. Stable, semantic IDs are the API endpoints of your content.
Grounding: Looking at the Research
Research from Microsoft and others on Retrieval Augmented Generation suggests that “document structure awareness” significantly boosts performance.
When models are fed structured inputs (like JSON or Semantic HTML), their “Hallucination Rate” drops. Why? because the model doesn’t have to spend its “cognitive budget” (parameters) figuring out the layout. It can focus entirely on the content.
It’s the difference between reading a clean PDF and reading a crumpled napkin.
OpenClaw: The Agentic Crawler
Our own internal crawler, OpenClaw, operates on these principles. It is designed to be “DOM-Aware.” Unlike Googlebot, which renders the page to see what it looks like, OpenClaw parses the page to see what it means.
It looks for <main> to find the primary content.
It looks for <nav> to learn the site topology.
It looks for <time> to ground facts in history.
If you block OpenClaw (or similar agentic crawlers like GPTBot) from seeing this structure—by serving them a JavaScript shell that requires client-side hydration—you are effectively invisible to the agentic web. You are a blank page.
The Cost of “Hydration”
This brings us to a critical point: Server-Side Rendering (SSR) vs. Client-Side Rendering (CSR).
Agents are expensive to run. They pay per token. They pay per millisecond of compute. They do not want to run a headless browser just to read your text.
If your Semantic HTML is only visible after JavaScript executes (Hydration), you are forcing the agent to spend 100x more resources to read your page. In an economic model of search, agents will prioritize “cheap” (semantic, server-rendered) sources over “expensive” (JS-heavy) sources.
Key Definition: Hydration is the process of using client-side JavaScript to add application state and interactivity to server-rendered HTML. For agents, it is often unnecessary overhead.
Conclusion: Build an API, Not a Brochure
The website of 2030 is not a brochure. It is an API. It is a database that agents query to get answers for their users.
Semantic HTML is the schema of that database. It is the contract you sign with the AI. You promise: “This header accurately describes the text below it.” “This table accurately represents the data.” “This date is the real publication date.”
If you break that contract—if you serve Div Soup—the agents will stop visiting. They will trust your competitors who speak their language.
Further Reading: