Glossary

Exponential backoff

Exponential backoff is a retry strategy where you wait longer after each failed request, typically doubling the delay each time. It helps scrapers recover from temporary failures like rate limits, timeouts, and overloaded targets without hammering the site and making the problem worse.

Examples

A basic pattern is: fail, wait 1 second, fail again, wait 2 seconds, then 4, then 8. In scraping, this matters when a target is flaky or starts pushing back with 429s.

import time
import requests

url = "https://example.com"
delay = 1
max_retries = 5

for attempt in range(max_retries):
    try:
        resp = requests.get(url, timeout=20)
        if resp.status_code == 200:
            print(resp.text)
            break
        if resp.status_code in (429, 500, 502, 503, 504):
            raise requests.HTTPError(f"retryable status: {resp.status_code}")
        resp.raise_for_status()
    except Exception as e:
        if attempt == max_retries - 1:
            raise
        time.sleep(delay)
        delay *= 2

A slightly less naive version adds jitter so all your workers do not retry at the exact same moment.

import time
import random

base_delay = 1
for attempt in range(5):
    delay = base_delay * (2 ** attempt)
    jitter = random.uniform(0, 0.5 * delay)
    time.sleep(delay + jitter)

Practical tips

  • Use exponential backoff for temporary failures: 429, 502, 503, 504, connection resets, timeouts.
  • Do not keep retrying permanent failures: 400, 401, 403 on a blocked session, bad URLs, broken request payloads.
  • Add jitter: without it, parallel workers tend to retry in sync, which is a good way to create your own traffic spike.
  • Cap the delay: example, 1s, 2s, 4s, 8s, 16s, then stop or hold at a max like 30s.
  • Set a retry budget: if a job is low value, do not spend 2 minutes retrying it just because the code can.
  • Respect target hints: if you get a Retry-After header, use it.
  • In production, log the retry reason and delay. If you are backing off constantly, the real fix is often concurrency control, better session handling, or proxy rotation, not more retries.
def should_retry(status_code):
    return status_code in {429, 500, 502, 503, 504}
# Example of checking for a Retry-After header
curl -I https://example.com

Use cases

  • A retailer starts returning 429 after you ramp from 2 workers to 20: backoff reduces pressure and gives requests a chance to start succeeding again.
  • A JavaScript-heavy site has intermittent upstream 503s during peak hours: retries with backoff recover a decent chunk of those failures without manual intervention.
  • A scraping job runs overnight across thousands of pages: backoff prevents a short-lived outage from turning into a wall of immediate failed requests.
  • When using ScrapeRouter or any router layer, backoff still matters at the job level: routing helps with provider selection and resilience, but your system still needs sane retry behavior so you do not waste credits, time, and threads on dumb retry loops.

Related terms

Rate limiting Retry logic HTTP 429 Timeout Proxy rotation Concurrency control Circuit breaker