Examples
A few common ways blacklisting shows up in production:
- IP blacklist: one proxy works for a while, then every request from that IP starts returning 403 or CAPTCHA.
- Session blacklist: the IP still works, but a specific cookie or session token gets challenged every time.
- Fingerprint blacklist: requests fail because the browser or TLS fingerprint is now associated with automation.
import requests
url = "https://example.com/products"
proxies = {
"http": "http://user:pass@proxy-host:8000",
"https": "http://user:pass@proxy-host:8000",
}
for i in range(5):
r = requests.get(url, proxies=proxies, timeout=30)
print(i, r.status_code)
This kind of script often works at first, then falls off a cliff once the target starts associating repeated traffic with the same identity.
curl -I https://example.com/products
# HTTP/1.1 200 OK
curl -I --proxy http://user:pass@proxy-host:8000 https://example.com/products
# HTTP/1.1 403 Forbidden
That does not always mean the proxy pool is bad. A lot of the time the traffic pattern is bad, or the target has already seen too much from that identity.
Practical tips
- Treat blacklisting as a traffic management problem, not just a proxy problem.
- Rotate more than IPs: sessions, headers, browser fingerprints, and request timing.
- Slow down obvious burst patterns: same path, same interval, same fingerprint is how you get noticed.
- Watch for early signals: rising 403s, more redirects to challenge pages, sudden CAPTCHA spikes, longer response times.
- Separate retries from recovery: retrying the same blocked identity usually just burns money faster.
- Keep per-target rules: what works on a simple ecommerce site will not hold up on a protected SERP or ticketing site.
- If you do not want to manage that yourself, this is the kind of thing a router layer should absorb: provider switching, session handling, fingerprint variation, and failover.
blocked_statuses = {403, 429}
if response.status_code in blocked_statuses or "captcha" in response.text.lower():
# retire identity instead of blindly retrying
rotate_ip = True
rotate_session = True
backoff_seconds = 30
# Bad: hammering the same endpoint from one identity
for i in {1..50}; do
curl -s https://example.com/search?q=shoes > /dev/null
done
# Better: spread requests out and vary identity upstream
Use cases
- SERP scraping: search engines blacklist fast if requests are repetitive and come from low-trust IP ranges.
- Ecommerce monitoring: product pages may work fine, then inventory or pricing endpoints start blocking after sustained polling.
- Account-based automation: sometimes the account gets blacklisted before the IP does, especially on logged-in flows.
- Large-scale crawling: blacklisting becomes a cost problem fast, because failed retries, burned proxies, and engineer time pile up.
In practice, blacklisting is one of the main reasons simple scraping setups look fine in a demo and then become unreliable in production.