Glossary

WAF

A WAF, or Web Application Firewall, sits in front of a site and filters requests before they reach the application. In scraping, it is one of the main things blocking you in production: rate limits, CAPTCHAs, bot checks, fingerprinting, and silent 403s often come from the WAF, not the site itself.

Examples

A few common signs you're dealing with a WAF instead of a normal app response:

  • You get a 403 on perfectly valid requests
  • The same URL works in a browser but fails in your scraper
  • Response bodies contain CAPTCHA markup, challenge pages, or JavaScript checks
  • Success rate drops hard when concurrency goes up
  • Different IPs get different behavior for the same request
curl -I https://target-site.com/products
{
  "status": 403,
  "server": "cloudflare",
  "content_type": "text/html"
}

That does not always mean Cloudflare is the whole problem, but it tells you something is sitting in front of the origin and making blocking decisions.

With a scraping API, the point is not to "beat" a WAF once. The point is to keep requests working as defenses change.

import requests

url = "https://www.scraperouter.com/api/v1/scrape/"
headers = {
    "Authorization": "Api-Key $api_key",
    "Content-Type": "application/json"
}
payload = {
    "url": "https://target-site.com/products",
    "render": True
}

r = requests.post(url, headers=headers, json=payload)
print(r.status_code)
print(r.text[:500])

Practical tips

  • Treat WAF blocking as a systems problem, not a header-tweaking problem: IP reputation, TLS fingerprint, browser behavior, cookie flow, request pacing, and session consistency all matter
  • Watch for soft blocks, not just hard failures: empty pages, fake 200s, login walls, poisoned HTML, and challenge pages returned with success status codes
  • Test with realistic traffic patterns: low-volume single requests often work, production concurrency is where things break
  • Keep browser and non-browser paths separate: some targets are fine with plain HTTP, others need full browser execution to get through anti-bot checks
  • Measure cost against engineering time: building your own WAF handling stack is possible, but it turns into ongoing maintenance fast
  • Do not assume one successful request means the problem is solved: what matters is stable success rate over thousands of requests and over time
  • Rotate carefully: random IP rotation without session logic often makes detection worse
  • If a target is heavily protected, using a router layer can save a lot of wasted effort because it can switch approach per site instead of forcing one method everywhere

Use cases

  • Scraping ecommerce sites protected by Cloudflare, Akamai, DataDome, or AWS WAF
  • Running price monitoring jobs where requests need to keep working every day, not just in local tests
  • Collecting search result pages that trigger CAPTCHA or JavaScript challenges under load
  • Mixing lightweight HTTP fetches with browser-based requests depending on how aggressive the target's WAF is
  • Reducing time spent debugging blocks that are really infrastructure and fingerprinting issues, not parser bugs

Related terms

CAPTCHA Proxy Rotation Browser Fingerprinting Headless Browser Rate Limiting Anti-Bot