The Agentic View: Why We Should Block Google from Indexing Most Pages

We have spent the last decade complaining about “Crawled - currently not indexed.” We treat it as a failure state. We treat it as a bug.

But in the Agentic Web of 2025, “Indexation” is not the goal. “Retrieval” is the goal.

And paradoxically, to maximize Retrieval, you often need to minimize Indexation.

The Information Density Argument

LLMs (Large Language Models) and Search Agents operate on Information Density. They want the highest signal-to-noise ratio possible.

When you allow Google to index 10,000 low-value pages—tag archives, filtered category pages, thin “Location” pages, paginated comments—you are diluting your domain’s density. You are effectively serving the agent a watered-down soup instead of a concentrated broth.

“Crawled - Not Indexed” is actually Google doing you a favor. It is identifying the noise and filtering it out for you.

The problem is that you let them crawl it in the first place.

Stop Feeding the Beast Junk Food

Every time Googlebot crawls a low-value URL on your site, you are burning “Crawl Budget.” But more importantly, you are burning Crawl Attention.

If Googlebot spends 80% of its resources on your site parsing parametrised URLs (?sort=price&color=blue), it has only 20% left for your core content. This is why your profound article on “The Future of AI” isn’t getting indexed immediately—because the bot is busy choking on your faceted navigation.

The Strategic “Noindex”

Instead of fighting to get “Crawled - Not Indexed” pages into the index, you should be fighting to get them out of the crawl queue entirely.

Your goal should be a 100% Indexation Rate for a very small set of high-value pages.

If you have 5,000 pages and 500 are indexed, you have a 10% efficiency rating.
If you prune/noindex 4,000 of those pages, you have 1,000 pages and 500 indexed. Now you have a 50% efficiency rating.
The algorithm notices efficiency. A site with high efficiency gets crawled more often and indexed faster.

Indexing is a Liability

Every indexed page is a liability.

Maintenance Liability: You have to update it.
Cannibalization Liability: It might compete with your money pages.
Quality Liability: If it’s low quality, it drags down the domain-wide quality score.

In 2025, we should treat “Indexing” as a privilege we grant only to our best content. For everything else—login pages, thank you pages, internal search results, tag clouds—we should be aggressively using noindex or robots.txt disallow rules.

The “Agent-First” Site Architecture

An Agent-First architecture is inverted.

Default State: noindex.
Exception: High-Value Content (Articles, Core Products, About Us).

By adopting this stance, you stop worrying about “Crawled - Not Indexed.” You start defining exactly what should be indexed. You take control of the graph.

Why Agents Prefer “Small and Dense”

When an agent like a custom GPT or a Search Generative Experience (SGE) bot processes your site, it has a limited context window. It cannot “read” your entire 10,000-page site in one pass.

If your sitemap is bloated with low-value URLs, the agent might fill its context window with junk before it finds the gold.

By aggressively pruning, you ensure that any random sample of your site yields high-value tokens. You are optimizing for Token Quality, not Page Count.

Conclusion: Embrace the Purge

The next time you look at your “Crawled - Currently Not Indexed” report, do not ask: “How do I get these indexed?”

Ask: “Why did I let Google find these in the first place?”

Then go to your robots.txt and block them. Go to your meta robots tags and set them to noindex.

Shrink your site to grow your influence.

Read about “Pruning” strategies at Ahrefs