In the early days of the web, scraping was a smash-and-grab operation. You wrote a script, you hit the server as hard as your bandwidth allowed, you grabbed the HTML, and you ran. If you got blocked, you rotated your IP and did it again.
But in the Agentic Web of 2025, that approach is not just rude; it is counter-productive.
As we move from a web of “Documents for Humans” to a web of “Data for Agents,” the relationship between the Host (the website) and the Visitor (the scraper) is shifting from adversarial to transactional. Websites want to be scraped by high-quality agents (like LLMs and Search/Answer engines), but they cannot afford to be DDoS’d by poorly written scripts.
If you want your agent to have long-term access to high-quality data, you must follow the Protocols of Etiquette. You must be a “Good Bot.”
This article outlines the technical controls, standards, and best practices for modern web scraping. By following these rules, you ensure that your focused crawler causes zero harm to the host and maintains a sustainable data pipeline.
1. The Identification Handshake: The User-Agent String
The first and most critical control is identification. When you knock on a door, you introduce yourself. When your script sends an HTTP request, it sends a User-Agent header.
Do not use the default python-requests/2.31.0 or Scrapy/2.11.0. These are the “hoodies” of the internet. They scream “I am a script, and I am putting in zero effort.” Most WAFs (Web Application Firewalls) like Cloudflare or AWS WAF will block these generic strings by default.
What Must Be in Your User-Agent?
A helpful User-Agent string contains four key components:
- Bot Name: A unique identifier for your bot.
- Version: The current version of your bot (helps hosts track changes).
- Platform/Architecture: (Optional but polite) e.g.,
+https://example.com/bot. - Contact Information: This is mandatory. You must provide a way for the sysadmin to contact you if your bot goes rogue.
Bad User-Agent
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36
Why it’s bad: This is “spoofing.” You are lying. You are pretending to be a human on a Chrome browser. While this might bypass some basic filters, it destroys trust. If a sysadmin sees this hitting their server 10 times a second, they know it’s a lie, and they will ban the IP subnet.
Good User-Agent
User-Agent: MyResearchBot/1.0 (+https://example.com/research-bot; bot@example.com)
Why it’s good: It tells the truth. It gives a URL for more info. Most importantly, it gives an email address (bot@example.com).
If your script goes into an infinite loop and starts hammering a small e-commerce site, the sysadmin can email you: “Hey, your bot is killing our search page, please stop.” If you provide no contact info, their only recourse is to block you hard—often specifically blocking your IP range or ASN.
Mobile vs. Desktop vs. Script
Should you identify as a browser?
- If you are rendering JavaScript (Headless Chrome/Puppeteer): It is acceptable to include the standard browser components in your string, as long as you append your contact info.
- If you are just fetching HTML (Requests/BeautifulSoup): Do ont pretend to be a browser. Be proud of your script nature.
2. Respecting the Traffic Laws: Robots.txt
The robots.txt file is the oldest standard on the web (created in 1994). It is a text file living at the root of a domain (e.g., https://example.com/robots.txt) that tells robots which parts of the site they cannot visit.
Ignoring robots.txt is the cardinal sin of scraping.
Parsing Robots.txt Properly
Do not try to parse robots.txt with a regex. It is more complex than it looks. It supports wildcards (*), ends-with matching ($), and specific user-agent directives.
Use a battle-tested library.
Table 1: Open Source Robots.txt Parsers
| Language | Library | Best For | Notes |
|---|---|---|---|
| Python | robotexclusionrulesparser | General Scraping | Very robust, handles non-standard directives well. Better than the built-in urllib.robotparser. |
| Python | protego | Scrapy | The default parser for the Scrapy framework. extremely fast, written in pure Python. |
| Node.js | robots-parser | JS/TS Agents | Compliant with Google’s spec. API is simple: isAllowed(url, ua). |
| Go | temoto/robotstxt | High Performance | Used in high-throughput Go crawlers. |
How to Implement Strict Compliance
Your scraper loop should look like this:
from robotexclusionrulesparser import RobotExclusionRulesParser
rp = RobotExclusionRulesParser()
rp.fetch("https://example.com/robots.txt")
target_url = "https://example.com/private-data"
user_agent = "MyResearchBot/1.0"
if rp.is_allowed(user_agent, target_url):
scrape(target_url)
else:
print(f"Skipping {target_url}: Blocked by robots.txt")
The “Crawl-Delay” Directive
Some robots.txt files include a Crawl-Delay: 10 directive. This asks you to wait 10 seconds between requests.
- Standard Compliance: Google ignores this. Bing respects it.
- Your Compliance: You must respect it. If a site asks for a delay, it usually means their server is underpowered. Ignoring it will cause a 503 error (Service Unavailable) or a ban.
3. Velocity Control: Knowing When to Slow Down
The speed at which you crawl is the difference between “Optimization” and “Denial of Service.”
Safe Default: 1 Request Per Second (1 RPS)
If you are scraping a new domain and don’t know its capacity, the safest default crawl rate is 1 request per second. This is slow enough that even a shared hosting plan on a $5 server can handle it without noticing.
Dynamic Throttling (The “Backoff” Strategy)
Network conditions change. A site that handles 10 RPS at 3 AM might crash at 10 RPS during Black Friday code freeze. You must listen to the server’s signals.
Signals to Watch For:
- HTTP 429 (Too Many Requests): This is the server explicitly screaming “STOP!” When you see a 429, you must stop immediately, wait (sleep), and then retry with a slower rate.
- HTTP 503 (Service Unavailable): This often means the database is overloaded. Back off.
- Increasing Latency: If your first request took 200ms, and your 100th request took 2000ms, you are suffocating the server. Slow down.
Implementing Exponential Backoff
When you hit a 429 or 503, do not just retry immediately. Use Exponential Backoff.
- Attempt 1: Fail. Wait 2 seconds.
- Attempt 2: Fail. Wait 4 seconds.
- Attempt 3: Fail. Wait 8 seconds.
- Attempt 4: Give up.
This gives the server breathing room to recover.
4. HTTP Headers: More Than Just User-Agent
To be a “Good Bot,” you should provide full context in your HTTP headers. This helps the web server format the content correctly for you and trace your requests.
Accept-Encoding: gzip, deflate, br: This tells the server “I can handle compressed data.” This is crucial. Compressed HTML is often 70-80% smaller than raw HTML. Using this saves the host bandwidth and speeds up your scrape.Accept-Language: en-US,en;q=0.9: Tells the server which language you prefer.Referer: If you found a URL on Page A that links to Page B, when you scrape Page B, set theRefererto Page A. This helps the webmaster understand the structure of your crawl in their analytics.From: (Legacy but polite) Put your email address here as well.From: bot@example.com.
5. Fixed IP Ranges and Transparency
For large-scale scraping operations (e.g., if you are building a vertical search engine or an AI training set), you cannot hide behind residential proxies. You must be transparent.
Why Use Fixed IPs?
Using a rotating residential proxy network (like Bright Data or Oxylabs) is great for sneaking past blocks, but it looks like a botnet attack to the sysadmin. They see thousands of random IPs hitting their login page.
If you scrape from a Fixed IP Range (e.g., a dedicated block of IPv4 addresses from AWS or a specific datacenter), the sysadmin sees all traffic coming from a known source.
Documenting Your IPs
If you are a serious agent, you should publish your IP ranges, just like Google (Googlebot.json) and OpenAI (gptbot-ranges) do. Create a page on your website (linked in your User-Agent) that lists your current IPs.
Example ips.json:
{
"creationTime": "2025-10-15T10:00:00",
"prefixes": [
{
"ipv4Prefix": "192.0.2.0/24",
"service": "MyResearchBot"
},
{
"ipv6Prefix": "2001:db8::/32",
"service": "MyResearchBot"
}
]
}
This allows savvy webmasters to whitelist you in their firewall, guaranteeing you access even when they block everyone else.
6. Resources and Documentation
To build a truly robust scraper, learn from the masters. The documentation for major crawlers serves as the bible for scraper etiquette.
- Google Search Central - Googlebot: The gold standard. Read how they maximize efficiency and handle errors.
- Python Requests Documentation: The standard library for HTTP in Python. Master usage of Sessions and Adapters.
- Scrapy Documentation: The most popular scraping framework. Their “AutoThrottle” middleware documentation is a masterclass in dynamic rate limiting.
- Mozilla Developer Network (MDN) - HTTP Headers: Understand exactly what
Cache-ControlandETagdo, so you don’t re-download unchanged content. - Common Crawl FAQ: Learn how the internet’s largest open scraper manages politeness at scale.
Conclusion: The Long Game
In the Agentic Web, data is the new oil, but access is the new currency.
If you scrape aggressively, break robots.txt, and spoof your User-Agent, you might get the data today. But tomorrow, you will be in the firewall’s blocklist. You will be fighting a constant, expensive war against anti-bot measures (Cloudflare Turnstile, CAPTCHAs, Fingerprinting).
If you scrape politely, identify yourself, and respect the host’s resources, you become a partner. You build a sustainable pipeline. In a world where AI agents will be negotiating for real-time access to APIs and databases, your reputation as a “Good Bot” is your most valuable asset.
Be visible. Be polite. Be efficient.