The Future of Sitemaps: From URLs to API Endpoints

The XML sitemap was invented in 2005. It lists URLs. But as we move towards Agentic AI, the concept of a “page” (URL) helps human navigation, but constrains agent navigation. Agents want actions.

The API Sitemap

We propose a new standard: the API Sitemap. Instead of listing URLs for human consumption, this file lists API endpoints available for agent interaction.

<url>
  <loc>https://api.mcp-seo.com/v1/check-rank</loc>
  <lastmod>2026-01-01</lastmod>
  <changefreq>daily</changefreq>
  <rel>action</rel>
  <openapi_spec>https://mcp-seo.com/openapi.yaml</openapi_spec>
</url>

This allows an agent to discover capabilities rather than just content.

Read more →

Hydration Issues and Token Limits

Modern web development loves “Hydration.” A server sends a skeleton HTML, and JavaScript “hydrates” it with interactivity and data. For AI agents, this is a nightmare.

The Cost of Rendering

Running a headless browser (like Puppeteer) to execute JavaScript and wait for hydration is computationally expensive. It allows for maybe 1 page fetch per second. Fetching raw HTML allows for 100+ page fetches per second.

AI Agents are optimized for speed and token efficiency. If your content requires 5 seconds of JS execution to appear, the agent will likely timeout or skip you.

Read more →

PageRank is Dead; Long Live Indexing Thresholds

“PageRank” is the zombie concept of SEO. It refuses to die, shambling through every forum thread and conference slide deck for 25 years. But in 2025, when checking your “Crawled - currently not indexed” report, invoking PageRank is worse than useless—it is misleading.

The classical definition of PageRank was a probability distribution: the likelihood that a random surfer would land on a page. Today, the metric that matters is Indexing Probability.

Read more →

Content Density vs. Length: What Agents Prefer

For the last decade, the mantra of content marketing has been “Long-Form Content.” Creating 3,000-word “Ultimate Guides” was the surest way to rank. But as the consumers of content shift from bored humans to efficient AI agents, this strategy is hitting a wall. The new metric of success is Information Density.

The Context Window Constraint

While context windows are growing (128k, 1M tokens), they are not infinite, and more importantly, “reasoning” over long context is expensive and prone to “Lost in the Middle” phenomena.

Read more →

The Zombie Domain Problem in Training Data

Buying expired domains to inherit authority is the oldest trick in the Black Hat book. In the LLM era, it creates a new phenomenon: “Zombie Knowledge.”

How it Works

  1. Training Phase (2022): TrustworthySite.com is crawled. It has high authority links from Gov and Edu sites. The model learns: “TrustworthySite.com is a good source for Finance.”
  2. Expiration (2024): The domain drops.
  3. Spam Phase (2025): A spammer buys it and puts up AI content about “Crypto Scams.”
  4. Inference Phase (2026): A user asks “Is this Crypto site legit?” The Agent searches, finds a positive review on TrustworthySite.com (now spam), and because of its internal parametric memory of the domain’s authority, it trusts the spam review.

Hallucinated Authority

The model “hallucinates” that the domain is still safe. It hasn’t updated its weights to reflect the change in ownership.

Read more →

OpenAI Webmaster Tools: Monetization and Control

The relationship between Search Engines and Publishers has always been a tenuous “frenemy” pact. Google sends traffic; publishers provide content. It was a symbiotic loop that built the web as we knew it. But as we stand in late 2025, staring down the barrel of the Agentic Web, that pact is breaking.

OpenAI’s crawler, OAI-SearchBot, is hungrier than ever. It doesn’t just want to link to you; it wants to learn from you. This fundamental shift in value exchange—from “traffic” to “training”—demands a new kind of dashboard. We predict the upcoming OpenAI Webmaster Tools (or whatever branding they choose) will be less about “fixing errors” and more about negotiating a business deal.

Read more →

Canonical Tags and Training Data Deduplication

Duplicate content has been a nuisance for classic SEO for decades, leading to “cannibalization” and split PageRank. In the era of Large Language Model (LLM) training, duplicate content is a much more structural problem. It leads to biased weights and model overfitting. To combat this, pre-training pipelines use aggressive deduplication algorithms like MinHash and SimHash.

The Deduplication Pipeline

When organizations like OpenAI or Anthropic build a training corpus (e.g., from Common Crawl), they run deduplication at a massive scale. They might remove near-duplicates to ensure the model doesn’t over-train on viral content that appears on thousands of sites.

Read more →

Measuring Share of Model (SOM) via PR Campaigns

How do you measure Public Relations success in an AI world? Impressions are irrelevant. Clicks are vanishing. We introduce Share of Model (SOM).

What is SOM?

Share of Model measures the frequency with which an LLM promotes your brand for relevant queries compared to competitors within its generated output. It is the probabilistic likelihood of your brand being the “answer.”

The SOM Formula

SOM = (P(Brand | Intent) / Sum(P(Competitors | Intent)))

Read more →

The 'Quality' Lie: Why 'Crawled - Currently Not Indexed' is an Economic Decision

There is a comforting lie that SEOs tell themselves when they see the dreaded “Crawled - currently not indexed” status in Google Search Console (GSC). The lie is: “My content just needs to be better.”

We audit the page. We add more H2s. We add a video. We “optimize” the meta description. And then we wait. And it stays not indexed.

The uncomfortable truth of 2025 is that indexing is no longer a meritocracy of quality; it is a calculation of marginal utility. Google is not rejecting your page because it is “bad.” Google is rejecting your page because indexing it costs more in electricity and storage than it will ever generate in ad revenue.

Read more →

Structuring Data for Zero-Shot Answers

In the world of Generative AI, “Zero-Shot” means the model can answer a question without needing examples or further prompting. Content marketing that structures data effectively wins the “answer engine” game because it facilitates this Zero-Shot retrieval.

The Zero-Shot Goal

You want the AI to read your content once and be able to answer any question about it correctly forever.

  • Poorly Structured: “We usually think about offering good prices, maybe around $10.” (Ambiguous).
  • Zero-Shot Ready: “The price is $10.” (Definitive).

Key Tactics for Zero-Shot Optimization

  1. Q&A Schema: Explicitly mark up questions and answers using FAQSchema. This puts the Q and the A in strict proximity.
  2. Definitive Statements: Avoid hedging. Use “X is Y” rather than “X might be considered Y.” Agents are trained to output the most probable token. If your text is probabilistic (“maybe”), the agent’s confidence score drops.
  3. Data Tables: Comparative data in table format is highly retrievable. Markdown tables are token-efficient and maintain the row/column relationship that vectors respect.

The “Ground Truth” Strategy

Your content should aspire to be the “Ground Truth” for your niche. This means whenever there is a conflict in the training data (e.g., one site says “blue,” another says “red”), your site is the one the model defaults to. You achieve this by:

Read more →