Glossary

Blacklisting

Blacklisting is when a site marks your IP, session, account, or request pattern as untrusted and starts blocking, throttling, or challenging you. In scraping, it usually happens because your traffic looks automated, too aggressive, or just too repetitive over time.

Examples

A few common ways blacklisting shows up in production:

  • IP blacklist: one proxy works for a while, then every request from that IP starts returning 403 or CAPTCHA.
  • Session blacklist: the IP still works, but a specific cookie or session token gets challenged every time.
  • Fingerprint blacklist: requests fail because the browser or TLS fingerprint is now associated with automation.
import requests

url = "https://example.com/products"
proxies = {
    "http": "http://user:pass@proxy-host:8000",
    "https": "http://user:pass@proxy-host:8000",
}

for i in range(5):
    r = requests.get(url, proxies=proxies, timeout=30)
    print(i, r.status_code)

This kind of script often works at first, then falls off a cliff once the target starts associating repeated traffic with the same identity.

curl -I https://example.com/products
# HTTP/1.1 200 OK

curl -I --proxy http://user:pass@proxy-host:8000 https://example.com/products
# HTTP/1.1 403 Forbidden

That does not always mean the proxy pool is bad. A lot of the time the traffic pattern is bad, or the target has already seen too much from that identity.

Practical tips

  • Treat blacklisting as a traffic management problem, not just a proxy problem.
  • Rotate more than IPs: sessions, headers, browser fingerprints, and request timing.
  • Slow down obvious burst patterns: same path, same interval, same fingerprint is how you get noticed.
  • Watch for early signals: rising 403s, more redirects to challenge pages, sudden CAPTCHA spikes, longer response times.
  • Separate retries from recovery: retrying the same blocked identity usually just burns money faster.
  • Keep per-target rules: what works on a simple ecommerce site will not hold up on a protected SERP or ticketing site.
  • If you do not want to manage that yourself, this is the kind of thing a router layer should absorb: provider switching, session handling, fingerprint variation, and failover.
blocked_statuses = {403, 429}

if response.status_code in blocked_statuses or "captcha" in response.text.lower():
    # retire identity instead of blindly retrying
    rotate_ip = True
    rotate_session = True
    backoff_seconds = 30
# Bad: hammering the same endpoint from one identity
for i in {1..50}; do
  curl -s https://example.com/search?q=shoes > /dev/null
done

# Better: spread requests out and vary identity upstream

Use cases

  • SERP scraping: search engines blacklist fast if requests are repetitive and come from low-trust IP ranges.
  • Ecommerce monitoring: product pages may work fine, then inventory or pricing endpoints start blocking after sustained polling.
  • Account-based automation: sometimes the account gets blacklisted before the IP does, especially on logged-in flows.
  • Large-scale crawling: blacklisting becomes a cost problem fast, because failed retries, burned proxies, and engineer time pile up.

In practice, blacklisting is one of the main reasons simple scraping setups look fine in a demo and then become unreliable in production.

Related terms

Proxy Rotation Rate Limiting CAPTCHA Session Management Fingerprinting Residential Proxies Retries