Semantic HTML is LLM Training Fuel: Why 'Div Soup' Poisons Models

In the early days of the web, we were told to use Semantic HTML for accessibility. We were told it allowed screen readers to navigate our content, providing a better experience for the visually impaired. We were told it might help SEO, though Google’s engineers were always famously coy about whether an <article> tag carried significantly more weight than a well-placed <div>.

In 2025, that game has changed entirely. We are no longer just optimizing for screen readers or the ten blue links on a search results page. We are optimizing for the training sets of Large Language Models (LLMs).

When an LLM like GPT-5, Claude 3.5 Opus, or the latest open-source Llama derivative consumes the web, it doesn’t just read the visible text. It ingests the code. And in the eyes of a model, Semantic HTML is not just “nice to have”—it is the difference between high-fidelity knowledge and noisy, hallucination-prone garbage. It is the fuel that powers accurate retrieval, and the map that guides the model through the chaotic landscape of the internet.

The ‘Div Soup’ Problem in Vector Space

Imagine you are trying to learn a complex subject from a textbook, but all the formatting has been stripped away. No chapter titles, no bold text to highlight key terms, no paragraph breaks, no distinct sections—just a continuous, unbreaking stream of characters. This is exactly what we feed LLMs when we serve them “Div Soup”—nested layers of generic <div> and <span> tags with no semantic meaning attached.

LLMs rely heavily on Attention Mechanisms. In the transformer architecture, “attention” is the mathematical process that determines how strongly one token relates to another. Structure is a primary, if often overlooked, signal for attention.

HTML Structure	LLM Interpretation	Attention Weight Estimate	Reliability Score
`<div><b>Topic</b></div>`	Text with bold styling. Ambiguous importance. Could be a header, a button, or an ad.	Low	2/10
`<h1>Topic</h1>`	Document Root / Primary Entity. High probability of being the main subject.	Critical	10/10
`<div class="nav">Link</div>`	Generic container. Could be content, could be noise. Relies on class name heuristics.	Low/Noise	3/10
`<nav><a>Link</a></nav>`	Navigation/boilerplate. Explicitly separated from the “Knowledge Graph” of the page.	Excluded/Low	0/10 (Correctly Ignored)
`<tr><td>Data</td></tr>`	Tabular relationship. Row implies a hard connection between cells in the same row.	High (Relational)	9/10
`<aside>Content</aside>`	Tangential information. Related but not core to the main thesis.	Moderate (Contextual)	5/10

When an ingestion engine sees <main>, it immediately knows: “Everything inside here is the signal.” When it sees <footer> or <nav>, it explicitly correctly identifies: “This is the noise.” By using semantic tags, you are essentially pre-labeling your data for the model, significantly reducing the compute required to understand it and increasing the likelihood that your content is weighed correctly in the vector space.

Boilerplate Detection and the “Chrome” of the Web

One of the most persistent challenges in creating massive training datasets (like Common Crawl, C4, or RefinedWeb) is Boilerplate Removal. We have written extensively about Boilerplate Detection Algorithms, but the short version is this: algorithms need to separate the “Chrome” (menus, sidebars, footers, copyright notices) from the “Content” (the actual article, product description, or forum discussion).

In a visually rendered page, this distinction is obvious to the human eye. We intuitively ignore the sidebar. In code, however, this is statistically difficult.

The Bad Approach: Relying on DOM depth or text-to-tag ratios. This is computationally expensive and prone to error. A long navigation menu can look like a list of important topics.
The Agentic Approach: Using explicit semantic boundaries (<article>, <aside>, <header>).

When you use <aside> for your sidebar, you are giving the crawler a “Skip Link.” You are saying, “This content is tangentially related, but do not hallucinate that it is part of the main thesis.” This distinction is vital for training accuracy. If your “Related Posts” or “Most Popular” widget is in the same semantic container as your article, an LLM might accidentally conflate the entities. This leads to training errors where the model believes Article A discusses Topic B simply because they appeared in the same <div> block without a semantic separator.

The Nightmare of Flattened Tables

Perhaps the single greatest tragedy in LLM training data is the widespread destruction of tables.

In modern responsive design, it became trendy in the 2010s to replace <table> elements with CSS Grid or Flexbox layouts using <div>s to achieve specific visual effects on mobile devices. While this looks fine on a mobile phone, it destroys the underlying logic of the data.

A <table> is a knowledge graph in miniature. It defines explicit, 2-dimensional relationships between headers and cells.

Row 1, Column A maps to Row 1, Column B.
Column Header A defines the ontology of all cells in that column.

When you replace this with divs, you force the LLM to statistically guess the relationships based on visual proximity (if it even renders the CSS, which most training crawlers do not) or DOM proximity. This is often where “hallucinations” in data extraction come from. The model sees a list of numbers and a list of names but loses the rigid coordinate system that binds them together. Use <table>, <thead>, <tbody>, TH, and TD. Your data fidelity depends on it.

As noted in recent research on Table Structure and LLMs, models struggle significantly with “serialized” or flattened tables. Explicit <table> markup is the most efficient way to transfer structured data into a model’s weights without relying on complex OCR or screenshot-to-text pipelines.

Token Efficiency as a Ranking Factor

There is also a purely economic argument for Semantic HTML in training: Token Efficiency.

LLM training is bounded by compute and context window size. Every token costs money to process. Semantic tags are “dense” tokens—they carry a massive amount of meta-information in a few characters.

Consider a list of items:

The Div Way:

<div class="list-item">Item 1</div>
<div class="separator"></div>
<div class="list-item">Item 2</div>

The Semantic Way:

<ul>
  <li>Item 1</li>
  <li>Item 2</li>
</ul>

The semantic version is not only shorter (fewer tokens) but also mathematically precise. The <ul> tag implies an “Unordered sets” relationship. The model knows that the order of items does not matter, which affects how it encodes the information. An <ol> (Ordered List) would imply hierarchy or sequence. These subtle cues are “free” logic gates for the neural network, allowing it to compress the information more effectively during training.

Optimizing for the C4 Dataset and Beyond

The Colossal Clean Crawled Corpus (C4) is one of the foundational datasets for modern LLMs (used to train T5 and others). Its cleaning pipeline specifically looks for “quality” signals to filter out low-value content. While the exact heuristics evolve, proper document structure is a strong proxy for quality.

If we look at how Google’s Bert and subsequent models process inputs, they rely heavily on “Sentence Embeddings.” Semantic HTML helps define the boundaries of these sentences and paragraphs. A <p> tag is a hard delimiter. A <br> tag is a soft delimiter. A <div> is ambiguous.

By adhering to strict Semantic HTML, you are optimizing your content for Ingestion Compatibility. You are ensuring that when the “Vacuum Cleaners” of the AI world (like Common Crawl, or specific agentic scrapers) come by, your content is sucked up, parsed, and weighted with the highest possible fidelity.

The Hierarchy of Heading Tags: A Table of Contents for the Mind

Heading tags (<h1> through <h6>) are the skeleton of your content. In the era of skim-reading humans, they were useful visual anchors. In the era of deep-reading AIs, they are the Table of Contents for the Mind.

When an LLM processes a long context window—say, 100,000 tokens—it searches for structure to organize that information. Headings act as “Embeddings Anchors.” They allow the model to segment the text into coherent chunks.

<h1>: The global context. The “Subject” of the entire vector.
<h2>: The major branches of the topic.
<h3>: The specific leaves or details.

If you skip heading levels (e.g., jumping from <h1> to <h4> because you liked the font size), you are breaking the logical tree. You are telling the model, “This information is deeply nested,” when it is actually a top-level concept. This confuses the model’s understanding of the relationship between concepts. It is akin to putting a “Chapter 1” heading inside a footnote.

Semantic Tags: A Dictionary for Agents

Let’s look at some specific tags and what they whisper to the AI:

<figure> and <figcaption>: This is the gold standard for image recognition. It explicitly binds an image to its description. “This text explains this image.” Without it, the model has to guess if the text below the image is a caption or just the next paragraph.
<details> and <summary>: These tags are fascinating for “Progressive Disclosure.” They tell the model, “Here is a high-level summary, and here is the deep-dive detail.” This structure is incredibly useful for training models to summarize content themselves.
<address>: This isn’t just for physical addresses. It defines contact information for the author or owner of a document. It helps specific “Entity Extraction” tasks identify the Who behind the What.
<time>: Dates are notoriously difficult for LLMs to parse because of different formats (DD/MM/YYYY vs MM/DD/YYYY). The <time datetime="2025-11-15"> attribute provides a standardized, machine-readable timestamp that grounds the content in temporal reality.

Conclusion: Structure is the New SEO

We used to say “Content is King.” In the Agentic Era, Structure is King. Content without structure is just noise in the vector space. It is sand in the gears of the incredible machines we are building.

By returning to the fundamentals of Semantic HTML, we are not just building better websites for 2025; we are building the training manuals for the superintelligences of 2030. We are ensuring that our knowledge is preserved, understood, and effectively utilized by the synthetic minds that will augment our own.

So, the next time you reach for a <div>, pause. Ask yourself: “Is there a word for this?” If it’s an article, use <article>. If it’s a list, use <ul>. If it’s a quote, use <blockquote>. Speak the language of the machine, and the machine will speak for you.

Further Reading: