Glossary

throttling

Throttling is deliberately limiting how fast your scraper sends requests so you do not overwhelm a site or trigger its defenses. In production, this means controlling request rate, concurrency, or both per domain, per IP, or per job, because "send everything as fast as possible" is how scrapers get blocked and budgets get wasted.

Examples

A basic throttle in Python can be as simple as spacing requests out so you do not hammer the same host:

import time
import requests

urls = [
    "https://example.com/page/1",
    "https://example.com/page/2",
    "https://example.com/page/3",
]

for url in urls:
    response = requests.get(url, timeout=30)
    print(url, response.status_code)
    time.sleep(2)  # 1 request every 2 seconds

In async scrapers, throttling is often about limiting concurrency, not just adding sleep:

import asyncio
import aiohttp

semaphore = asyncio.Semaphore(3)

async def fetch(session, url):
    async with semaphore:
        async with session.get(url, timeout=30) as response:
            print(url, response.status)
            await asyncio.sleep(1)
            return await response.text()

async def main():
    urls = [
        "https://example.com/page/1",
        "https://example.com/page/2",
        "https://example.com/page/3",
        "https://example.com/page/4",
    ]
    async with aiohttp.ClientSession() as session:
        await asyncio.gather(*(fetch(session, url) for url in urls))

asyncio.run(main())

If you are routing requests through ScrapeRouter, throttling still matters. A router can handle proxy and anti-bot complexity, but you still want sane request pacing when hitting the same target repeatedly:

curl "https://www.scraperouter.com/api/v1/scrape/?url=https://example.com/products" \
  -H "Authorization: Api-Key $api_key"

Practical tips

  • Throttle by domain, not globally: amazon.com and a tiny Shopify store should not get the same request budget.
  • Control both request rate and concurrency: a scraper doing 50 concurrent requests with a 1-second delay is still aggressive.
  • Back off when you see warning signs: 429 responses, sudden CAPTCHA pages, connection resets, slower response times.
  • Add jitter to delays so traffic does not look machine-perfect:
import random
import time

time.sleep(random.uniform(1.5, 3.5))
  • Treat throttling as a cost control tool too: fewer wasted retries, fewer burned proxies, less time debugging avoidable blocks.
  • Do not hardcode one number and forget it: safe limits change by site, route, time of day, and whether you are logged in.
  • For serious jobs, keep per-target rules: requests per second, max concurrency, retry budget, cooldown after 429.

Use cases

  • Large catalog scraping: keep throughput steady across thousands of pages without getting rate-limited halfway through the run.
  • Multi-tenant scraping systems: stop one noisy customer job from burning all available proxy capacity on a single domain.
  • Fragile targets: throttle aggressively on sites that start blocking after small traffic spikes.
  • Cost-sensitive pipelines: reduce retries, bans, and dead requests that chew through proxy spend.
  • Scheduled refresh jobs: spread requests over time so daily updates keep working instead of creating one big traffic burst that gets noticed.

Related terms

rate limiting concurrency retry backoff proxy rotation 429 too many requests anti-bot