Glossary

Greylisting IP

Greylisting IP is when a site does not fully ban your IP, but quietly degrades or limits it because it looks suspicious. In scraping, this usually shows up as intermittent 403s, slower responses, CAPTCHA pages, empty results, or requests that work in a browser but fail from your scraper.

Examples

A greylisted IP is annoying because nothing is fully broken. Some requests still work, which makes debugging slower than a clean ban.

  • Search pages load, detail pages fail: category URLs return 200, product pages start returning 403
  • Responses get slower over time: first 20 requests are fine, then latency jumps and timeouts start
  • Soft blocks: the server returns 200, but the page is a CAPTCHA or a stripped-down version with missing data
  • Geo or reputation downgrade: the same request works from one IP pool and quietly fails from another
curl -I https://targetsite.com/products/123
# HTTP/2 200

curl https://targetsite.com/products/123 | head
# actually returns a challenge page or empty shell HTML
import requests

url = "https://targetsite.com/search?q=laptop"
for i in range(5):
    r = requests.get(url, timeout=30)
    print(i, r.status_code, len(r.text), r.elapsed.total_seconds())

If the status stays at 200 but body length drops hard, or latency keeps climbing, that is often greylisting rather than a normal site issue.

Practical tips

  • Watch for soft-failure signals: body length changes, challenge keywords, redirect loops, unusual latency, empty JSON payloads
  • Do not treat only 403 and 429 as blocking: greylisting often hides inside 200 responses
  • Rotate earlier, not later: once an IP starts looking degraded, it usually gets worse
  • Reduce request burstiness: randomize pacing, limit concurrency per domain, avoid hitting the same path pattern too hard
  • Vary the full request fingerprint: headers, TLS/client profile, cookies, session behavior; IP rotation alone is often not enough
  • Separate healthy and degraded sessions: if one session starts getting weird responses, quarantine it instead of reusing it
  • Measure by pool: compare success rate, latency, and challenge rate across datacenter, residential, and mobile IPs
  • If you use ScrapeRouter: this is the kind of thing a router layer should handle for you, because the real problem is not just getting an IP, it is knowing when an IP is technically alive but operationally bad
blocked_markers = ["captcha", "access denied", "verify you are human"]

def looks_greylisted(response):
    text = response.text.lower()
    return (
        response.status_code == 200 and any(m in text for m in blocked_markers)
    ) or response.elapsed.total_seconds() > 10

Use cases

  • High-volume product scraping: a retailer starts slowing one proxy subnet instead of banning it outright, so jobs drag for hours before failing
  • SERP collection: search result pages return partial or challenge-filled HTML from some IPs, while others still look normal
  • Account/session workflows: login works, but post-login pages get intermittently blocked because the IP has a poor reputation
  • API scraping behind web defenses: endpoints keep returning 200 with empty datasets for certain IPs, which looks like a parser bug until you compare across pools

Related terms

IP Ban Proxy Rotation Rate Limiting CAPTCHA Fingerprinting Residential Proxy Datacenter Proxy Soft Block