Crawling | ScrapeRouter

Crawling is the process of discovering pages by starting from one or more URLs, fetching them, extracting links, and following those links across a site or across the web. It is about finding what exists and what changed; scraping is the separate step where you extract the data you actually care about.

Examples

A basic crawler starts with a few seed URLs, keeps a queue, and visits new links as it finds them.

from collections import deque
from urllib.parse import urljoin, urlparse
import requests
from bs4 import BeautifulSoup

seed_urls = ["https://example.com"]
queue = deque(seed_urls)
seen = set(seed_urls)

while queue:
    url = queue.popleft()
    html = requests.get(url, timeout=10).text
    soup = BeautifulSoup(html, "html.parser")

    print(f"crawled: {url}")

    for a in soup.select("a[href]"):
        next_url = urljoin(url, a["href"])
        if urlparse(next_url).netloc == "example.com" and next_url not in seen:
            seen.add(next_url)
            queue.append(next_url)

In production, this gets messy fast: duplicate URLs, infinite calendar pages, crawl traps, rate limits, blocked requests, and pages that only exist after JavaScript runs.

Practical tips

Start with clear scope: one domain, specific path patterns, maximum depth, allowed and blocked URL rules.
Treat crawling and scraping as different jobs: crawling finds pages, scraping extracts fields.
Normalize URLs before enqueueing them: remove fragments, handle trailing slashes, and decide how to treat query params.
Expect crawl traps: faceted navigation, search pages, session URLs, and endless date archives can blow up your queue.
Respect site limits: throttle requests, retry carefully, and stop pretending one-thread local tests look anything like production.
Track freshness separately from discovery: some pages need recrawling often, others barely change.
If pages need a browser to render links, your crawler needs browser support too or it will miss half the site.

Use cases

Building a list of product pages before scraping price, stock, and metadata.
Monitoring a docs site for newly published or changed pages.
Discovering all category, listing, and detail pages on a marketplace.
Keeping a sitemap-like inventory of a site so downstream scraping jobs know what to fetch.
Checking internal links, page coverage, or content drift across a large site.