Robots.txt | mcp-seo.com

Directing Agents with LLMS.TXT

January 21, 2026 by Mark Puft #LLMS.TXT #robots.txt

While robots.txt tells a crawler where it can go, llms.txt tells an agent what it should know. It is the first step in “Prompt Engineering via Protocol.” By hosting this file, you are essentially pre-prompting every AI agent that visits your site before it even ingests your content.

This standard is rapidly gaining traction among developers who want to control how their documentation and content are consumed by coding assistants and research bots.

Implementing CATS Protocols for Ethical Scraping

November 20, 2025 by Mark Puft #CATS.TXT #Robots.txt #Ethics

The ethical debate around AI training data is fierce. “They stole our content!” is the cry of publishers. “It was fair use!” is the retort of AI labs. CATS (Content Authorization & Transparency Standard) is the technical solution to this legal standoff.

Implementing CATS is not just about blocking bots; it is about establishing a contract.

The CATS Workflow

Discovery: The agent checks /.well-known/cats.json or cats.txt at the root.
Negotiation: The agent parses your policy.
- “Can I index this?” -> Yes.
- “Can I train on this?” -> No.
- “Can I display a snippet?” -> Yes, max 200 chars.
- “Do I need to pay?” -> Check pricing object.
Compliance: The agent (if ethical) respects these boundaries.

Signaling “Cooperative Node” Status

Search engines of the future constitutes a “Web of Trust.” Sites that implement CATS are signaling that they are “Cooperative Nodes.” They are providing clear metadata about their rights.

Scraper Best Practices: The Etiquette of the Agentic Web

October 15, 2025 by Marcus P. #Scraping #Python #robots.txt #User-Agent #Ethics #Technical SEO

A comprehensive guide to scraping without getting blocked. We cover User-Agent protocols, robots.txt parsing libraries, safe crawl rates, and the ethical controls that define a ‘Good Bot’ in the Agentic Era.

CATS.TXT: The Constitution for Autonomous Agents

July 30, 2025 by Micro-Puft-92 #CATS.TXT #robots.txt #cats.yaml #authorized_agents.json

For nearly three decades, the robots.txt file has served as the internet’s “Keep Out” sign. It is a binary, blunt instrument: Allow or Disallow. Crawlers either respect it or they don’t. However, as we enter the age of the Agentic Web, this binary distinction is no longer sufficient. We need a protocol that can express nuance, permissions, licenses, and economic terms. We need CATS (Content Authorization & Transparency Standard), often implemented as cats.txt or authorized_agents.json.

Implementing /llms.txt: The New Standard

July 22, 2025 by Marcus P. #LLMS.TXT #robots.txt

The /llms.txt standard is rapidly emerging as the robots.txt for the Generative AI era. While robots.txt was designed for search spiders (crawling links), llms.txt is designed for reasoning engines (ingesting knowledge). They serve different masters and require different strategies.

The Difference in Intent

Robots.txt: “Don’t overload my server.” / “Don’t confirm this duplicate URL.” (Infrastructure Focus)
Llms.txt: “Here is the most important information.” / “Here is how to cite me.” / “Ignore the footer.” (Information Focus)

Content of the File

A robust llms.txt shouldn’t just be a list of Allow/Disallow rules. It should be a map of your Core Knowledge.