Glossary

Robots.txt

Robots.txt is a text file on a website, usually served from /robots.txt, that tells crawlers which paths they are allowed or asked not to crawl. It is a crawler-facing policy file, not an enforcement mechanism, so decent bots read it and bad ones ignore it.

Examples

A simple robots.txt file looks like this:

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /public/
Crawl-delay: 5
Sitemap: https://example.com/sitemap.xml

If you're building a scraper, the first check is often just fetching the file:

curl https://example.com/robots.txt

In Python, you can parse it before crawling:

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()

url = "https://example.com/products/123"
user_agent = "my-scraper"

if rp.can_fetch(user_agent, url):
    print("Allowed to crawl")
else:
    print("Blocked by robots.txt")

Practical tips

  • Treat robots.txt as an early check, not the whole compliance story: terms of service, rate limits, login walls, and country-specific legal issues still matter.
  • Read the rules for your actual user-agent first, then fall back to * if there is no specific match.
  • Don’t confuse Disallow with access control: if a page is public, robots.txt does not secure it, it just states crawler rules.
  • Expect messy real-world files: invalid syntax, conflicting rules, huge files, missing files, and directives some parsers ignore.
  • In production, cache robots.txt and refresh it periodically instead of downloading it before every request.
  • If a site allows crawling but your scraper hammers it, you're still doing it wrong: respect concurrency, backoff, and crawl pace.
  • ScrapeRouter handles request routing and anti-bot work, but robots.txt policy is still your decision. The router can get the page; it does not decide what you should crawl.

A practical flow looks like this:

from urllib.robotparser import RobotFileParser

robot_url = "https://example.com/robots.txt"
target_url = "https://example.com/catalog"
user_agent = "inventory-monitor-bot"

rp = RobotFileParser()
rp.set_url(robot_url)
rp.read()

if not rp.can_fetch(user_agent, target_url):
    raise Exception("Do not crawl this path")

print("Safe to continue")

Use cases

  • Search engine crawling: bots use robots.txt to avoid pages the site owner does not want crawled, like admin areas, internal search pages, or duplicate content.
  • Commercial scraping: teams check robots.txt before collecting pricing, listings, or catalog data so they have a clear policy signal before they put a job into production.
  • Internal crawler design: engineering teams use robots.txt parsing to keep broad crawlers from wandering into useless or sensitive sections and wasting bandwidth.
  • Operational guardrails: before scaling a crawl, robots.txt gives you a cheap first-pass filter so you don't burn proxy budget and engineering time fetching pages you already know you should skip.

Related terms

Crawler User-Agent Rate Limiting Web Scraping Sitemap Crawl Budget Proxy Rotation