Examples
A few scraping realities that show up when a CDN is in front of a site:
- Static assets come from the CDN, not the app: images, JS, CSS, and sometimes even cached HTML are served from edge nodes
- Different behavior by region: the same URL can return different cache states, headers, or block pages depending on where your request exits
- Bot protection often lives there: Cloudflare is the obvious one, but even plain CDN setups can enforce rate limits before you ever touch the origin server
curl -I https://example.com/
You might see headers like these:
server: cloudflare
cf-cache-status: HIT
via: 1.1 varnish
x-cache: Miss from cloudfront
age: 342
That tells you a lot already: the response may be cached, may be coming from an edge location, and may never have hit the origin for your request.
If you're scraping rendered pages, a CDN can also sit in front of an app that still needs JavaScript execution. So you end up dealing with two separate problems:
- Delivery layer: cache rules, edge blocks, geo behavior
- Rendering layer: client-side JS, API calls, dynamic content
Practical tips
- Check response headers first: look for
server,cf-cache-status,x-cache,via,age,etag - Treat CDN issues and site issues as different things: a cached 403, origin 403, and JS-rendered empty page are not the same failure mode
- Vary exit geography when results look inconsistent: some CDN edges are stricter than others, some caches are stale, some regions get challenged more often
- Don't assume HTML came from the origin: if the CDN is caching aggressively, you may be debugging the wrong layer
- Watch for rate limits at the edge: if requests fail before session setup, login flow, or page render, the CDN is probably where you're getting stopped
- Use a browser when the site needs it, not by default: a lot of people jump straight to headless browsers when the real problem is edge blocking or cache variance
- If you need stability in production, use a router layer that can swap proxies, browser settings, and anti-bot handling without rewriting your scraper every week
Quick header check:
import requests
r = requests.get("https://example.com/", timeout=30)
print(r.status_code)
for k, v in r.headers.items():
if k.lower() in {"server", "cf-cache-status", "x-cache", "via", "age", "etag"}:
print(f"{k}: {v}")
If you want ScrapeRouter to handle the ugly part for harder targets:
curl -X POST https://www.scraperouter.com/api/v1/scrape/ \
-H "Authorization: Api-Key $api_key" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/"
}'
Use cases
- Scraping ecommerce pages behind Cloudflare or CloudFront: product pages may be cached at the edge, but pricing, stock, or session-specific content may still come from the origin or client-side APIs
- Monitoring site changes: if a CDN caches HTML, your scraper can miss updates unless you understand cache behavior and validation headers
- Collecting content across regions: CDNs can localize content, enforce country-level controls, or route traffic differently by geography
- Debugging inconsistent blocks: one IP range gets challenged, another works fine, because the edge layer is making decisions before the application even sees the request
- Reducing scraper maintenance: once a target is behind multiple CDN and anti-bot layers, the work stops being "fetch a page" and turns into ongoing delivery-layer maintenance