In the high-stakes poker game of Modern SEO, llms.txt is the competitor’s accidental “tell.”
For two decades, we have scraped sitemaps to understand a competitor’s scale. We have scraped RSS feeds to understand their publishing velocity. But sitemaps are noisy—they contain every tag page, every archive, every piece of legacy drift. They tell you what exists, but they don’t tell you what matters.
The llms.txt file is different. It is a curated, high-stakes declaration of what a website owner believes is their most valuable information. By defining this file, they are explicitly telling OpenAI, Anthropic, and Google: “If you only read 50 pages on my site to answer a user’s question, read these.”
For an SEO, this is pure Competitive Intelligence gold.
If you want to know which products your competitor is pushing into the AI answering engines, or which support docs they believe are critical for reducing hallucinations, you don’t need a crawler. You just need to read their llms.txt.
This guide provides the technical protocols to turn this public declaration of strategy into a structured list of URLs for your competitive analysis.
The Anatomy of a Strategic Signal
Before we write code, we must understand the schema. The official LLMS.TXT specification (proposed by simple-ai) encourages a Markdown format.
Links in Markdown come in a standard structure: [Link Text](URL).
However, llms.txt files often contain more than just links. They contain context, summaries, and vital sections like “Core Offerings” or “pricing methodologies”. A naive regex that simply grabs everything starting with http will fail. It might capture external references, vendor links, or example URLs.
We are interested specifically in the target URLs. These are the pages your competitor is betting their Agentic Visibility on.
standard markdown link
- [Enterprise Pricing Logic](https://mcp-seo.com/pricing/enterprise)
raw url (less common but valid)
<https://mcp-seo.com/pricing/enterprise>
Our parsers need to be robust enough to handle the former, which is the standard for high-quality llms.txt files, while extracting the strategic intent.
The Linux Bash One-Liner
Sometimes, you don’t need a heavy Python environment. You are doing a quick audit of a competitor’s domain.
Here is the one-liner that effectively extracts the priority URLs from a remote llms.txt file. We will use https://mcp-seo.com/llms.txt as our target, but this works for any standard file.
curl -sL "https://mcp-seo.com/llms.txt" | grep -oP '\[.*?\]\(\K[^)]+' | sort -u > competitor_priorities.txt && echo "Extracted $(wc -l < competitor_priorities.txt) Priority URLs"
breaking down the protocol
Let’s dissect this command pipe by pipe. We are chaining standard Unix utilities to perform a complex extraction task. This is the “Unix Philosophy” in action—small tools doing one thing well.
| Command | Arguments | Purpose |
|---|---|---|
curl | -sL | -s (Silent): Don’t show the progress bar or error messages. -L (Follow Redirects): Crucial. If llms.txt redirects, we follow. |
grep | -oP 'pattern' | -o (Only Matching): Print only the matched parts. -P (Perl-Regex): Activates Perl-compatible regular expressions for precision extraction. |
sort | -u | (Unique): Sorts and deduplicates. If they link to their “Pricing” page 5 times, it’s a huge signal, but we only need to scrape it once. |
> | file.txt | Redirect Output: Saves the intelligence into a file. |
the regex explained
The heart of this operation is the Perl-compatible regex: \[.*?\]\(\K[^)]+.
\[.*?\]: Matches the link text part of the markdown, e.g.,[My Link]. The?makes it non-greedy.\(: Matches the opening parenthesis of the URL part.\K: This is a Perl special sequence. It resets the starting point of the reported match. Everything matched before the\Kis discarded from the output. This is how we avoid printing the link text and the opening parenthesis. It is a look-behind assertion on steroids.[^)]+: Matches one or more characters that are not a closing parenthesis. This effectively captures the URL until the closing).
The Python Colab Script
For a more scalable workflow, or for running this across a list of top 10 competitors, a Google Colab notebook is the ideal delivery mechanism.
We will use the Google Colab Forms feature to create a UI where you can simply paste the URL of the competitor’s llms.txt file. The script will then fetch it, parse it, and trigger a download of their priority list.
You can paste the following code directly into a Google Colab cell.
# @title Website LLMS.TXT URL Scraper
# @markdown Enter the URL of the sites's llms.txt file to reveal their AI content strategy.
import requests
import re
from google.colab import files
from urllib.parse import urljoin, urlparse
# @markdown ### Configuration
target_url = "https://mcp-seo.com/llms.txt" # @param {type:"string"}
output_filename = "urls.txt" # @param {type:"string"}
def fetch_and_extract(url):
"""
Fetches an llms.txt file and extracts URLs explicitly formatted
in Markdown link syntax [text](url).
"""
headers = {
'User-Agent': 'Mozilla/5.0 (compatible; Intell-a-Bot/1.0)'
}
try:
response = requests.get(url, allow_redirects=True, headers=headers, timeout=10)
response.raise_for_status()
content = response.text
except requests.exceptions.RequestException as e:
print(f"Error fetching {url}: {e}")
return []
# Regex to find standard markdown links: [text](URL)
# We ignore images  by ensuring the character before [ is not !
# This is a robust pattern for standard markdown links
markdown_link_pattern = re.compile(r'(?<!\!)\[.*?\]\((.*?)\)')
urls = markdown_link_pattern.findall(content)
# Clean and resolve URLs
cleaned_urls = []
base_url_parsed = urlparse(url)
base_url_for_join = f"{base_url_parsed.scheme}://{base_url_parsed.netloc}{base_url_parsed.path}"
for u in urls:
u = u.strip()
if u.startswith('http'):
cleaned_urls.append(u)
elif u.startswith('/') or (not urlparse(u).scheme and not urlparse(u).netloc): # Check for absolute path or relative path
# Resolve relative URL against the base URL
resolved_url = urljoin(base_url_for_join, u)
cleaned_urls.append(resolved_url)
# Deduplicate while preserving order
return list(dict.fromkeys(cleaned_urls))
print(f"Fetching {target_url}...")
extracted_urls = fetch_and_extract(target_url)
if extracted_urls:
print(f"Found {len(extracted_urls)} unique strategic URLs.")
with open(output_filename, 'w') as f:
for url in extracted_urls:
f.write(f"{url}\n")
print(f"Saved to {output_filename}. Triggering download...")
files.download(output_filename)
else:
print("No URLs found or unable to fetch the file.")
code walkthrough
- Forms Integration: The
# @titleand# @paramcomments are interpreted by Google Colab to render a clean user interface. - User-Agent: We define a custom User-Agent. Many servers block requests from the default python-requests library. Identifying yourself as a “CompetitiveIntel-Bot” is honest, though in practice, you might often spoof a Chrome browser to avoid detection.
- Negative Lookbehind: The regex
(?<!\!)ensures we do not match images (). We want content strategy, not their logo assets. - Deduplication: We use
list(dict.fromkeys(cleaned_urls))instead ofset()to remove duplicates. This preserves the order of the links. Order matters. The first 10 links in anllms.txtare likely the “Core Identity” links they want LLMs to ingest first.
Strategic Comparison: The Triad of Discovery
When you are auditing a competitor, you have three primary sources of truth: The Sitemap, the RSS Feed, and the LLMS.TXT file.
Which one should you use? The answer depends entirely on what you want to know about their strategy.
| Feature | Sitemaps (XML) | RSS Feeds (XML) | LLMS.TXT (Markdown) |
|---|---|---|---|
| Intelligence Value | Site Structure | Content Velocity | AI Content Strategy |
| Audience | Search Bots (Googlebot) | News Readers / Aggregators | AI Agents / LLMs |
| Update Frequency | Slow (Daily/Weekly) | Real-time | Strategic (Pivot Points) |
| Content Scope | Comprehensive (All pages) | Recent (Last 10-50 posts) | Curated (Hero Content) |
| Signal-to-Noise | Low (Flooded with archives) | Medium (Flooded with news) | Very High (Only value) |
| SEO Insight | “How big is their site?” | “How often do they post?” | “What do they want AI to know?” |
The Verdict
If you simply want to know how many pages they have, download their Sitemap. It’s the phone book.
If you want to know if they are active, check their RSS Feed. It’s the daily news.
But if you want to know their Agentic SEO Strategy, scraping their LLMS.TXT is mandatory.
The llms.txt file is effectively a “Human-In-The-Loop” filter applied to the competitor’s own content. They have explicitly chosen these pages as the most representative and useful for an AI. If they are listing their “API Documentation” before their “Product Pages,” they are betting on developer mindshare. If they are listing “Case Studies” first, they are betting on enterprise trust.
Scraping this list allows you to reverse-engineer their objective function. You can then analyze these specific pages to see how they are optimizing for vectors, what schema they are using, and how they are structuring their grounding data.
For a deeper dive into Bash scripting, consult the GNU Bash Reference Manual. For Python specifics, the Requests documentation is essential reading.