Glossary

CDN

A CDN, or content delivery network, is a distributed layer of servers that caches and serves website assets closer to the user. In scraping, CDNs matter because they change how content is delivered, cached, rate-limited, and blocked, especially when providers like Cloudflare or CloudFront sit in front of the origin.

Examples

A few scraping realities that show up when a CDN is in front of a site:

  • Static assets come from the CDN, not the app: images, JS, CSS, and sometimes even cached HTML are served from edge nodes
  • Different behavior by region: the same URL can return different cache states, headers, or block pages depending on where your request exits
  • Bot protection often lives there: Cloudflare is the obvious one, but even plain CDN setups can enforce rate limits before you ever touch the origin server
curl -I https://example.com/

You might see headers like these:

server: cloudflare
cf-cache-status: HIT
via: 1.1 varnish
x-cache: Miss from cloudfront
age: 342

That tells you a lot already: the response may be cached, may be coming from an edge location, and may never have hit the origin for your request.

If you're scraping rendered pages, a CDN can also sit in front of an app that still needs JavaScript execution. So you end up dealing with two separate problems:

  • Delivery layer: cache rules, edge blocks, geo behavior
  • Rendering layer: client-side JS, API calls, dynamic content

Practical tips

  • Check response headers first: look for server, cf-cache-status, x-cache, via, age, etag
  • Treat CDN issues and site issues as different things: a cached 403, origin 403, and JS-rendered empty page are not the same failure mode
  • Vary exit geography when results look inconsistent: some CDN edges are stricter than others, some caches are stale, some regions get challenged more often
  • Don't assume HTML came from the origin: if the CDN is caching aggressively, you may be debugging the wrong layer
  • Watch for rate limits at the edge: if requests fail before session setup, login flow, or page render, the CDN is probably where you're getting stopped
  • Use a browser when the site needs it, not by default: a lot of people jump straight to headless browsers when the real problem is edge blocking or cache variance
  • If you need stability in production, use a router layer that can swap proxies, browser settings, and anti-bot handling without rewriting your scraper every week

Quick header check:

import requests

r = requests.get("https://example.com/", timeout=30)
print(r.status_code)
for k, v in r.headers.items():
    if k.lower() in {"server", "cf-cache-status", "x-cache", "via", "age", "etag"}:
        print(f"{k}: {v}")

If you want ScrapeRouter to handle the ugly part for harder targets:

curl -X POST https://www.scraperouter.com/api/v1/scrape/ \
  -H "Authorization: Api-Key $api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/"
  }'

Use cases

  • Scraping ecommerce pages behind Cloudflare or CloudFront: product pages may be cached at the edge, but pricing, stock, or session-specific content may still come from the origin or client-side APIs
  • Monitoring site changes: if a CDN caches HTML, your scraper can miss updates unless you understand cache behavior and validation headers
  • Collecting content across regions: CDNs can localize content, enforce country-level controls, or route traffic differently by geography
  • Debugging inconsistent blocks: one IP range gets challenged, another works fine, because the edge layer is making decisions before the application even sees the request
  • Reducing scraper maintenance: once a target is behind multiple CDN and anti-bot layers, the work stops being "fetch a page" and turns into ongoing delivery-layer maintenance

Related terms

Proxy Rotation Headless Browser Rate Limiting Cloudflare Fingerprinting JavaScript Rendering HTTP Headers