Examples
A simple robots.txt file looks like this:
User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /public/
Crawl-delay: 5
Sitemap: https://example.com/sitemap.xml
If you're building a scraper, the first check is often just fetching the file:
curl https://example.com/robots.txt
In Python, you can parse it before crawling:
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
url = "https://example.com/products/123"
user_agent = "my-scraper"
if rp.can_fetch(user_agent, url):
print("Allowed to crawl")
else:
print("Blocked by robots.txt")
Practical tips
- Treat robots.txt as an early check, not the whole compliance story: terms of service, rate limits, login walls, and country-specific legal issues still matter.
- Read the rules for your actual user-agent first, then fall back to
*if there is no specific match. - Don’t confuse Disallow with access control: if a page is public, robots.txt does not secure it, it just states crawler rules.
- Expect messy real-world files: invalid syntax, conflicting rules, huge files, missing files, and directives some parsers ignore.
- In production, cache robots.txt and refresh it periodically instead of downloading it before every request.
- If a site allows crawling but your scraper hammers it, you're still doing it wrong: respect concurrency, backoff, and crawl pace.
- ScrapeRouter handles request routing and anti-bot work, but robots.txt policy is still your decision. The router can get the page; it does not decide what you should crawl.
A practical flow looks like this:
from urllib.robotparser import RobotFileParser
robot_url = "https://example.com/robots.txt"
target_url = "https://example.com/catalog"
user_agent = "inventory-monitor-bot"
rp = RobotFileParser()
rp.set_url(robot_url)
rp.read()
if not rp.can_fetch(user_agent, target_url):
raise Exception("Do not crawl this path")
print("Safe to continue")
Use cases
- Search engine crawling: bots use robots.txt to avoid pages the site owner does not want crawled, like admin areas, internal search pages, or duplicate content.
- Commercial scraping: teams check robots.txt before collecting pricing, listings, or catalog data so they have a clear policy signal before they put a job into production.
- Internal crawler design: engineering teams use robots.txt parsing to keep broad crawlers from wandering into useless or sensitive sections and wasting bandwidth.
- Operational guardrails: before scaling a crawl, robots.txt gives you a cheap first-pass filter so you don't burn proxy budget and engineering time fetching pages you already know you should skip.