While robots.txt tells a crawler where it can go, llms.txt tells an agent what it should know. It is the first step in “Prompt Engineering via Protocol.” By hosting this file, you are essentially pre-prompting every AI agent that visits your site before it even ingests your content.
This standard is rapidly gaining traction among developers who want to control how their documentation and content are consumed by coding assistants and research bots.
Read more →The ethical debate around AI training data is fierce. “They stole our content!” is the cry of publishers. “It was fair use!” is the retort of AI labs. CATS (Content Authorization & Transparency Standard) is the technical solution to this legal standoff.
Implementing CATS is not just about blocking bots; it is about establishing a contract.
The CATS Workflow
- Discovery: The agent checks
/.well-known/cats.json or cats.txt at the root. - Negotiation: The agent parses your policy.
- “Can I index this?” -> Yes.
- “Can I train on this?” -> No.
- “Can I display a snippet?” -> Yes, max 200 chars.
- “Do I need to pay?” -> Check
pricing object.
- Compliance: The agent (if ethical) respects these boundaries.
Signaling “Cooperative Node” Status
Search engines of the future constitutes a “Web of Trust.” Sites that implement CATS are signaling that they are “Cooperative Nodes.” They are providing clear metadata about their rights.
Read more →A comprehensive guide to scraping without getting blocked. We cover User-Agent protocols, robots.txt parsing libraries, safe crawl rates, and the ethical controls that define a ‘Good Bot’ in the Agentic Era.
Read more →For nearly three decades, the robots.txt file has served as the internet’s “Keep Out” sign. It is a binary, blunt instrument: Allow or Disallow. Crawlers either respect it or they don’t. However, as we enter the age of the Agentic Web, this binary distinction is no longer sufficient. We need a protocol that can express nuance, permissions, licenses, and economic terms. We need CATS (Content Authorization & Transparency Standard), often implemented as cats.txt or authorized_agents.json.
Read more →The /llms.txt standard is rapidly emerging as the robots.txt for the Generative AI era. While robots.txt was designed for search spiders (crawling links), llms.txt is designed for reasoning engines (ingesting knowledge). They serve different masters and require different strategies.
The Difference in Intent
- Robots.txt: “Don’t overload my server.” / “Don’t confirm this duplicate URL.” (Infrastructure Focus)
- Llms.txt: “Here is the most important information.” / “Here is how to cite me.” / “Ignore the footer.” (Information Focus)
Content of the File
A robust llms.txt shouldn’t just be a list of Allow/Disallow rules. It should be a map of your Core Knowledge.
Read more →