Mark Puft | mcp-seo.com

The Zombie Domain Problem in Training Data

September 25, 2025 by Mark Puft #expired domains #Vector drift

Buying expired domains to inherit authority is the oldest trick in the Black Hat book. In the LLM era, it creates a new phenomenon: “Zombie Knowledge.”

How it Works

Training Phase (2022): TrustworthySite.com is crawled. It has high authority links from Gov and Edu sites. The model learns: “TrustworthySite.com is a good source for Finance.”
Expiration (2024): The domain drops.
Spam Phase (2025): A spammer buys it and puts up AI content about “Crypto Scams.”
Inference Phase (2026): A user asks “Is this Crypto site legit?” The Agent searches, finds a positive review on TrustworthySite.com (now spam), and because of its internal parametric memory of the domain’s authority, it trusts the spam review.

Hallucinated Authority

The model “hallucinates” that the domain is still safe. It hasn’t updated its weights to reflect the change in ownership.

The Bedrock of Strategy: Geological Entities in Knowledge Graphs

September 9, 2025 by Mark Puft #GEO #Schema.org #Knowledge graph

Geological features are named entities. “Mount Everest” is an entity. “The San Andreas Fault” is an entity. “The Pierre Shale Formation” is an entity.

For researchers in the geospatial domain, linking your content to these distinct entities is the bedrock of MCP-SEO.

Disambiguation via Wikidata

“Paris” is a city in France. “Paris” is also a city in Texas. “Paris” is also a rock formation (hypothetically). To ensure an AI understands you are talking about the rock formation, you must link to its Wikidata ID (e.g., Q12345).

Optimizing Frontmatter for Retrieval

September 4, 2025 by Mark Puft #Markdown SEO #HTML

The metadata block at the top of a Markdown file, known as Frontmatter, is the most valuable real estate for MCP-SEO. It is structured data that sits before the content, framing the model’s understanding.

Beyond Title and Date

Most Hugo or Jekyll sites just use title and date. To optimize for retrieval, you should inject semantic richness here.

Recommended Fields

summary: A dense 50-word abstract. Agents often read this first to decide if the full document is worth processing.
keywords: Explicit vector keywords. “Neuroscience, synaptic, plasticity.”
entities: A list of named entities. ["Elon Musk", "Tesla", "SpaceX"].
complexity: “Beginner” | “Advanced”. Helps the agent match the user’s expertise level.

Example Frontmatter

---
title: "The Physics of Black Holes"
summary: "A technical overview of event horizons and Hawking radiation."
complexity: "PhD"
entities:
  - Stephen Hawking
  - Albert Einstein
tags: ["Astrophysics", "Gravity"]
---

The Retriever’s Shortcut

Many RAG systems index the Frontmatter separately or weight it heaver. By putting your core concepts in key-value pairs, you are essentially hand-feeding the indexer. You are saying, “This is exactly what this file is about.”

Citation Flow in the Age of LLMs

August 30, 2025 by Mark Puft #Link building #Citation

In the era of PageRank, “Link Juice” or Citation Flow flowed through hyperlinks (<a> tags). It was a directed graph where node A voted for node B. In the era of Large Language Models (LLMs), the graph is semantic, and the “juice” flows through Co-occurrence and Attribution.

From Hyperlinks to Training Data Weights

LLMs do not navigate the web by clicking links. They “read” the web during training. If your brand name appears frequently alongside authoritative terms (“reliable,” “expert,” “secure”) in high-quality text, the model learns these associations.

Optimizing Content for High Cosine Similarity

August 22, 2025 by Mark Puft #Cosine Similarity #Voodoo

Cosine Similarity is the core metric of the new search. It measures the cosine of the angle between two vectors in a multi-dimensional space. In the era of Answer Engines, it determines if your content is “relevant” enough to be retrieved for the user’s query.

If your content vector is orthogonal (90°) to the query vector, you are invisible. If it is parallel (0°), you are the answer.

The Math of Relevance

1.0: Identical meaning. The vectors point in the exact same direction.
0.0: Orthogonal (unrelated). The vectors are at 90 degrees.
-1.0: Opposite meaning.

Your goal is not “keyword density” but “cosine proximity.” You want your content vector to sit as close as possible to the Intent Vector, not just the Query Vector.

From Indexing to Grounding: The New SEO Metaphor

August 22, 2025 by Mark Puft #grounding #RAG

For twenty-five years, the primary metaphor of SEO was “Indexing.” The goal was to get your page into the database. Once indexed, you competed for rank based on keywords and links. It was a game of lists.

In the age of Generative AI, the metaphor has shifted fundamentally. We are no longer fighting for a slot in a list; we are fighting for Grounding.

What is Grounding?

Grounding is the technical process by which an AI model connects its generated output to verifiable external facts.

Language Vectors and Cross-Lingual Retrieval

July 27, 2025 by Mark Puft #Internationalization #Vector Search #GEO

Cross-lingual retrieval is the frontier of international SEO. With vector embeddings, the barrier of language is dissolving. A query in Spanish can match a document in English if the semantic vector is similar. This fundamental shift challenges everything we know about global site architecture.

How Vector Spaces Bridge Languages

In a high-dimensional vector space (like that created by text-embedding-ada-002 or cohere-multilingual), the concept of “Dog” (English), “Perro” (Spanish), and “Inu” (Japanese) cluster in the same geometric region. They are semantically identical, even if lexically distinct.