Examples
A basic crawler starts with a few seed URLs, keeps a queue, and visits new links as it finds them.
from collections import deque
from urllib.parse import urljoin, urlparse
import requests
from bs4 import BeautifulSoup
seed_urls = ["https://example.com"]
queue = deque(seed_urls)
seen = set(seed_urls)
while queue:
url = queue.popleft()
html = requests.get(url, timeout=10).text
soup = BeautifulSoup(html, "html.parser")
print(f"crawled: {url}")
for a in soup.select("a[href]"):
next_url = urljoin(url, a["href"])
if urlparse(next_url).netloc == "example.com" and next_url not in seen:
seen.add(next_url)
queue.append(next_url)
In production, this gets messy fast: duplicate URLs, infinite calendar pages, crawl traps, rate limits, blocked requests, and pages that only exist after JavaScript runs.
Practical tips
- Start with clear scope: one domain, specific path patterns, maximum depth, allowed and blocked URL rules.
- Treat crawling and scraping as different jobs: crawling finds pages, scraping extracts fields.
- Normalize URLs before enqueueing them: remove fragments, handle trailing slashes, and decide how to treat query params.
- Expect crawl traps: faceted navigation, search pages, session URLs, and endless date archives can blow up your queue.
- Respect site limits: throttle requests, retry carefully, and stop pretending one-thread local tests look anything like production.
- Track freshness separately from discovery: some pages need recrawling often, others barely change.
- If pages need a browser to render links, your crawler needs browser support too or it will miss half the site.
Use cases
- Building a list of product pages before scraping price, stock, and metadata.
- Monitoring a docs site for newly published or changed pages.
- Discovering all category, listing, and detail pages on a marketplace.
- Keeping a sitemap-like inventory of a site so downstream scraping jobs know what to fetch.
- Checking internal links, page coverage, or content drift across a large site.