Glossary

Concurrency

Concurrency is sending or processing multiple scraping tasks at the same time instead of waiting for each one to finish before starting the next. In scraping, that mostly means making many requests in flight at once so jobs finish faster, but pushing it too hard gets you rate-limited, blocked, or just creates a new reliability problem.

Examples

A simple scraper does one request, waits, then does the next. That works for 20 pages. It gets stupid fast when you need 20,000.

With concurrency, you keep multiple requests in flight at once:

import asyncio
import httpx

urls = [
    "https://example.com/page/1",
    "https://example.com/page/2",
    "https://example.com/page/3",
]

async def fetch(client, url):
    r = await client.get(url, timeout=30)
    return url, r.status_code

async def main():
    async with httpx.AsyncClient() as client:
        results = await asyncio.gather(*(fetch(client, url) for url in urls))
        print(results)

asyncio.run(main())

The point is not just speed. It's using waiting time better. Network requests spend most of their life blocked on I/O, so doing them one by one wastes time.

If you send requests through ScrapeRouter, concurrency still needs limits:

import asyncio
import httpx

api_key = "YOUR_API_KEY"
targets = [
    "https://example.com/page/1",
    "https://example.com/page/2",
    "https://example.com/page/3",
]

semaphore = asyncio.Semaphore(5)

async def scrape(client, url):
    async with semaphore:
        r = await client.post(
            "https://www.scraperouter.com/api/v1/scrape/",
            headers={"Authorization": f"Api-Key {api_key}"},
            json={"url": url},
            timeout=60,
        )
        return r.json()

async def main():
    async with httpx.AsyncClient() as client:
        results = await asyncio.gather(*(scrape(client, url) for url in targets))
        print(results)

asyncio.run(main())

Practical tips

  • Start with a concurrency limit, not unlimited fan-out. Good first step: 5 to 20 in-flight requests, then measure.
  • Separate concurrency from throughput: concurrency is how many jobs run at once, throughput is how many finish per second.
  • More concurrency does not automatically mean better scraping: too much and you hit rate limits, timeouts, proxy exhaustion, browser queue buildup, and retry storms.
  • Use the right tool for the bottleneck: async for lots of network I/O, threads for blocking libraries, processes for CPU-heavy parsing.
  • Put backpressure in the system: queues, semaphores, worker pools, and request budgets stop a scraper from melting down when a target slows down.
  • Watch production metrics: success rate, p95 latency, retry rate, ban rate, proxy usage, and cost per successful page.
  • Tune by target. One site handles 20 concurrent requests fine, another starts blocking at 3.

A basic async limit with a semaphore:

semaphore = asyncio.Semaphore(10)

async def fetch(client, url):
    async with semaphore:
        return await client.get(url)

If your scraper gets slower as you increase concurrency, that is not weird. It often means the target is throttling you or your own infrastructure is overloaded.

Use cases

  • Crawling listing pages faster: fetch many category or pagination URLs at once instead of serially.
  • Enriching large datasets: look up thousands of product, company, or profile pages without turning the job into an overnight batch.
  • Running scheduled scrapes in a fixed window: concurrency helps you finish before data goes stale.
  • Browser-based scraping at scale: you still need concurrency control because headless browsers are expensive and easy to overload.
  • API + fallback setups: run multiple scrape tasks concurrently while a router layer handles provider selection, retries, and anti-bot differences underneath.

In practice, concurrency matters when you move from "this script works" to "this job has to finish every day without babysitting." That is where speed, ban rate, infra cost, and failure handling all start interacting.

Related terms

Rate Limiting Async Scraping Retries Timeout Proxy Rotation Throughput Headless Browser