PageRank and LLMs: From Search Rankings to Training Weights

If you mention “PageRank” to a modern SEO, you are likely to get an eye-roll. It is the zombie concept of our industry—a term from 1998 that refuses to die, despite Google hiding the Toolbar score in 2013 and constantly telling us that “links are just one of many signals.”

But here is the irony: Just as traditional search engines are moving away from raw link counting, the Agentic Web is embracing it with renewed vigor.

In the era of Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG), PageRank has found a new, more powerful life. It is no longer just about ranking a document on a results page; it is about determining the weight of that document in the training corpus of a model. It determines whether a URL is crawled by CommonCrawl, whether it makes it into the C4 dataset, and ultimately, whether the model “knows” what you wrote when a user asks a question.

In this deep dive, we will explore the mathematical origins of PageRank, its evolution into sophisticated variants like TrustRank, and its critical, often invisible role in the training and grounding of modern AI systems.

1. The Origins: The 1998 Paper and the Random Surfer

To understand where we are going, we must understand where we started. In 1998, Larry Page and Sergey Brin published “The Anatomy of a Large-Scale Hypertextual Web Search Engine”. At the time, search engines like AltaVista and Excite ranked pages primarily based on keyword density (how many times “cars” appeared on the page). This was easily spammed.

Page and Brin had a different idea. They viewed the web as a graph.

The Intuition

They modeled the behavior of a “Random Surfer”—a hypothetical user who starts on a random page and clicks on links at random.

If a page has many links pointing to it, the Random Surfer is more likely to end up there.
If a page is linked to by an important page (where the surfer spends a lot of time), that link is worth more.
At any moment, the surfer might get bored and “teleport” to a completely random page (this is the damping factor, usually set to 0.85).

The Math

The classic PageRank formula for a page $A$ is:

$$PR(A) = (1 - d) + d \left( \frac{PR(T1)}{C(T1)} + \dots + \frac{PR(Tn)}{C(Tn)} \right)$$

Where:

$PR(A)$ is the PageRank of page A.
$d$ is the damping factor (usually 0.85).
$T1 \dots Tn$ are the pages that link to page A.
$C(T1)$ is the number of outbound links on page T1.

This formula is recursive. To know the PageRank of A, you need the PageRank of T1. To know T1, you need its inbound links. This is solved using an iterative process (power iteration) until the scores converge.

Example Calculation: Imagine a mini-web with 4 pages: A, B, C, and D.

A links to B.
B links to C.
C links to A, B, and D.
D links to A.
$d = 0.5$ (for simplicity).

Iteration 1 (Initialize all to 1):

$PR(A) = 1$
$PR(B) = 1$
$PR(C) = 1$
$PR(D) = 1$

Iteration 2:

$PR(A) = 0.5 + 0.5 * (PR(C) / 3 + PR(D) / 1) = 0.5 + 0.5 * (0.33 + 1) \approx 1.17$
$PR(B) = 0.5 + 0.5 * (PR(A) / 1 + PR(C) / 3) = 0.5 + 0.5 * (1 + 0.33) \approx 1.17$
$PR(C) = 0.5 + 0.5 * (PR(B) / 1) = 0.5 + 0.5 * 1 = 1.0$
$PR(D) = 0.5 + 0.5 * (PR(C) / 3) = 0.5 + 0.5 * 0.33 \approx 0.67$

Iteration 3:

$PR(A) = 0.5 + 0.5 * (PR(C) / 3 + PR(D) / 1) = 0.5 + 0.5 * (0.33 + 0.67) = 1.0$
$PR(B) = 0.5 + 0.5 * (PR(A) / 1 + PR(C) / 3) = 0.5 + 0.5 * (1.17 + 0.33) = 1.25$
$PR(C) = 0.5 + 0.5 * (PR(B) / 1) = 0.5 + 0.5 * 1.17 \approx 1.09$
$PR(D) = 0.5 + 0.5 * (PR(C) / 3) = 0.5 + 0.5 * 0.33 \approx 0.67$

Final Output (Converged):

$PR(A) \approx 1.02$
$PR(B) \approx 1.20$
$PR(C) \approx 1.10$
$PR(D) \approx 0.68$

  graph LR
    A[A: 1.02] --> B[B: 1.20]
    B --> C[C: 1.10]
    C --> A
    C --> B
    C --> D[D: 0.68]
    D --> A

In a real network (millions of nodes), high-quality nodes accumulate score, while “link farms” (isolated clusters) eventually starve due to the damping factor.

2. Evolution: From Global Authority to Topical Trust

The original PageRank was “global.” It assumed that a link from The New York Times was valuable regardless of the topic. If The NYT linked to a generic casino site, that casino site would rank for “poker.” This was the era of “Google Bombing.”

To fix this, search engines evolved the algorithm.

Topic-Sensitive PageRank

Proposed by Haveliwala in 2002, this variation modifies the “teleportation” probability. Instead of jumping to any random page, the surfer is more likely to jump to a page within a specific topic (e.g., Sports, Health, Tech). This creates a “bias” vector. A link from a tech blog is now worth more to another tech site than it is to a cooking site.

TrustRank and BadRank

Combating spam required a notion of “Trust.”

seed set: Humans manually review a set of trusted sites (universities, government sites, major news).
Trust Propagation: Trust flows out from these seeds. Good sites link to good sites.
BadRank: Conversely, spam sites link to spam sites. If you are close to a known spam node in the graph, your score is penalized.

This evolution is critical because modern LLM training pipelines use a version of TrustRank to filter their datasets.

3. The Role of Links in AI Training (Run-Time 1: The Crawler)

Now, let’s fast forward to 2025. When OpenAI (GPT-5), Anthropic (Claude 4), or Google (Gemini 2.0) train their models, they do not just “download the internet.” The internet is mostly garbage—spam, duplicates, low-quality auto-generated text. Training on garbage creates a hallucinating model.

They need a filter. That filter is links.

Determining Crawl Priority

The web is infinite. A crawler like CommonCrawl (the backbone of most open datasets) cannot crawl everything. It uses a specific scheduling algorithm to decide what to fetch next. That algorithm heavily weights URL Authority based on inbound links.

Discovery Speed: A page linked from the homepage of Wired.com will be discovered and crawled within minutes. A page on a deeper subdomain with no inbound links might sit for months before a crawler finds it.
Refresh Rate: High-PageRank pages are re-crawled more frequently. If you update your content, but you have low authority, the model’s training data might remain stale for years.

Research Note: A study on the CommonCrawl methodology shows that their breadth-first search strategy inherently biases the dataset towards highly connected nodes (the “rich get richer” phenomenon).

4. The Role of Links in Dataset Filtering (Run-Time 2: The Curator)

Once the data is downloaded, it goes through a massive filtering pipeline (like the C4 dataset pipeline used for Google’s T5). This is where PageRank becomes a “Gatekeeper.”

Harmonic Centrality and “Quality” Signals

Researchers utilize graph metrics like Harmonic Centrality (closely related to PageRank) to estimate the “quality” of a domain.

The C4 Dataset: In the creation of the C4 dataset (Colossal Clean Crawled Corpus), filters were applied to remove “lorem ipsum” and bad words, but more importantly, documents were effectively scored based on their source’s reputation.
Wikipedia Citations: Links from Wikipedia are often used as a “Gold Standard” proxy. If a URL is cited in Wikipedia, it is almost guaranteed to be included in the training set with a high weight.

Implication: If your site has zero inbound links from “trusted seeds” (Wikipedia, .edu, major news), there is a high statistical probability that your content will be discarded during the tokenization and filtering phase. It simply won’t be seen as “high quality” enough to justify the GPU cost of training on it.

5. The Role of Links in RAG and Grounding (Run-Time 3: The Generator)

This is the most direct impact on the user experience in 2025. When a user asks an Agent (like ChatGPT or Perplexity) a question, the system often performs a RAG (Retrieval-Augmented Generation) step.

User asks: “What are the effects of PageRank on LLMs?”
Agent searches its index.
Agent retrieves the “top k” chunks of text.
Agent synthesizes an answer.

How does it choose the “top k” chunks? It uses Vector Similarity (how closely the text matches the query). But it also uses Document Importance.

If two chunks have equal semantic similarity, the system needs a tie-breaker. That tie-breaker is almost always a derivative of PageRank or Domain Authority.

The “Citation Bias”

A study on Retrieval-Augmented Generation behavior suggests that models are often fine-tuned to prefer “authoritative” looking contexts. When an LLM cites a source, it is effectively “transferring authority” from that source to its answer.

High PageRank Source: The model treats the chunk as fact.
Low PageRank Source: The model might treat the chunk as opinion or noise, or ignore it entirely.

SEO Implication: You can have the most perfect, optimized content in the world (perfect “Vector Similarity”), but if your Domain Authority (PageRank) is too low, you will be cut from the context window before the LLM ever reads you. You are “below the fold” of the prompt.

6. SEO Implications: Optimizing for the Agentic Graph

So, what does this mean for an SEO strategy in 2026?

1. Link Building is Training Weight Optimization

Stop thinking about links as “votes for ranking.” Think of them as “votes for inclusion.”

Goal: Get links from “Seed Sites” (Wikipedia, Government, trusted Industry Hubs).
Why: These links ensure your content survives the “Quality Filter” of the training pipeline.
Tactic: Digital PR is no longer optional; it is the only way to get into the “Trusted Set.”

2. The Danger of the “Unlinked”

In the past, you could rank a page with zero links if the content was amazing (via “long-tail” keyword targeting). In the Agentic/LLM world, an unlinked page is effectively invisible. It might not even be crawled (due to crawl budget prioritization), and if it is, it will be discarded as “low quality noise” during dataset cleaning. Action Item: Ensure essentially zero “Orphan Pages” in your site structure. Internal linking is your first line of defense to distribute PageRank to your deep content.

3. Circular Reference Attacks

We are already seeing “Adversarial SEO” where bad actors create circular citation networks to fool RAG systems.

Site A cites Site B.
Site B cites Site C.
Site C cites Site A.
All three sites claim “Product X is the best.”

While simple algorithms might be fooled, modern Graph Neural Networks (GNNs) used by Google and OpenAI are getting better at detecting these “sybil attacks.” Defense:

Avoid Link Schemes: Do not participate in “Private Blog Networks” (PBNs) or link exchanges. The penalty is not just “lower rankings”; it is “total exclusion from the model’s worldview.”
Disavow Toxic Links: Just as with Google, you must monitor who links to you. If a known “poison node” links to you, disavow it in your robots.txt or via available webmaster tools.
Diversify Citations: Don’t just get links from one type of site. A natural link profile includes forums, news, blogs, and academic citations. An unnatural profile (only high-DA guest posts) looks statistically improbable to a Graph Neural Network.

7. The Future of PageRank: Agentic Networks

As we look toward 2030, the concept of a “link” is evolving. In the traditional web, a link was a hypertext reference (<a href="https://example.com/page">). In the Agentic Web, a link is an API call, a function invocation, or a data citation.

The “ActionRank”

We hypothesize that future iterations of PageRank will move beyond static document linking to Dynamic Action Linking.

Traditional Web: Page A links to Page B.
Agentic Web: Agent A calls Tool B.

If a specific MCP (Model Context Protocol) server or API is frequently called by high-reputation agents to solve complex tasks, its “ActionRank” will increase. This will determine which tools are presented to the model in the system prompt. Just as you optimize for backlinks today, you will optimize for Agent Invocations tomorrow.

The Rise of “Negative PageRank” for AI Slop

While TrustRank focused on identifying spam (casinos, pills), future algorithms will focus on identifying slop (low-value AI-generated content). Search engines and model trainers are already developing “entropy-based” classifiers.

High Entropy: Human writing, unpredictable, novel ideas.
Low Entropy: AI writing, predictable patterns, “delve,” “tapestry.”

If your site is identified as a “Slop Farm,” your PageRank will be deprecated, regardless of how many inbound links you have. The “d” (damping factor) for AI-generated content will be set much lower than for human-verified content.

Conclusion: Value Flow in the Neural Network

PageRank was created to bring order to the chaos of the 1990s web. It used the collective intelligence of human linking behavior to determine value.

Today, as we transition to a web dominated by AI agents, that signal is more precious than ever. In a sea of AI-generated content (which is effectively infinite), Human Curated Links are the only scarcity. They are the “Gold Standard” of truth.

For the SEO, the lesson is clear: The math hasn’t changed, but the stakes have. You are no longer fighting for a position on a list. You are fighting for a synapse in the digital brain.

References & Further Reading: