Glossary

100 terms

B

Backconnect Proxy

A backconnect proxy is a proxy endpoint that stays the same on your side while the provider rotates the exit IP behind it. In practice, it is a convenience layer for large proxy pools, usually used in scraping when you need rotation without constantly updating proxy lists yourself.

proxy rotation scraping networking infrastructure

Backoff

Backoff is the practice of waiting before retrying a request after a failure, block, or rate limit. In scraping, it helps you avoid hammering a site when it is already telling you to slow down, which improves stability and lowers the chance of getting banned.

retries rate-limits scraping reliability http

Bandwidth

Bandwidth is the amount of data your scraper sends and receives over the network. In scraping, it directly affects cost, speed, and how noisy your crawler looks to the target site, especially when you're pulling full pages, images, scripts, and retries you didn't actually need.

bandwidth performance cost network scraping optimization

Base64

Base64 is a way to encode binary data as plain text using a limited set of ASCII characters. You see it all over the web in things like image blobs, tokens, API payloads, and sometimes scraped responses where the useful data is wrapped in an extra decoding step.

encoding data api parsing scraping

Bearer token

A bearer token is an access token sent in the Authorization header to prove the client is allowed to use an API. Whoever has the token can use it, so in practice it works like a password for API requests and needs to be handled with the same care.

auth api http security tokens scraping

Behavioral detection

Behavioral detection is a class of anti-bot checks that looks at how a visitor behaves, not just what IP, headers, or fingerprint they show up with. It flags automation when timing, scrolling, clicks, navigation flow, and page interaction patterns look too clean, too fast, or too mechanically consistent to be a real user.

antibot detection browser scraping automation

Blacklisting

Blacklisting is when a site marks your IP, session, account, or request pattern as untrusted and starts blocking, throttling, or challenging you. In scraping, it usually happens because your traffic looks automated, too aggressive, or just too repetitive over time.

blocking proxies anti-bot scraping sessions fingerprints

Bounded Randomness

Bounded randomness means adding variation to scraper behavior inside sensible limits instead of using fixed, perfectly repeatable timing. In practice, that usually means random delays, dwell times, and request spacing that stay within a defined range, so you look less bot-like without turning the job into chaos.

scraping timing antibot throttling requests automation

Browser context

A browser context is an isolated browser session inside a single browser process. It gives you separate cookies, storage, and session state without paying the cost of launching a whole new browser every time, which matters a lot once you stop scraping one page at a time and start running real workloads.

browser playwright session rendering javascript automation scraping

Browser profile

A browser profile is the saved state a browser carries between sessions: cookies, local storage, cache, login state, preferences, and sometimes fingerprint-related settings. In scraping and automation, profiles matter when you need a session to keep behaving like the same user instead of starting from zero on every run.

browser automation session cookies state scraping

Burst limit

A burst limit is the maximum number of requests you can send in a short spike before rate limiting kicks in. It matters because many systems allow brief bursts above the steady request rate, but they still block you if that spike is too large or happens too often.

rate-limiting scraping throughput concurrency 429 throttling
C

Canonical URL

A canonical URL is the preferred URL for a page when the same content is available at multiple URLs. It tells search engines which version should be treated as the main one, so ranking signals do not get split across duplicates, parameterized URLs, or near-identical versions.

seo url html crawling indexing

Canvas Fingerprinting

Canvas fingerprinting is a browser fingerprinting technique where a site uses the HTML5 canvas API to draw hidden text or images, then hashes the rendered result to help identify a browser. It matters in scraping because those rendering differences can be used as a tracking signal, especially when your browser setup is inconsistent, headless, or obviously automated.

fingerprinting browser anti-bot scraping detection headless

CAPTCHA

A CAPTCHA is a challenge a site shows to figure out whether the visitor is a human or an automated script. In scraping, it usually means the target thinks your traffic looks suspicious, often because of bad IPs, broken browser fingerprints, or bot-like request patterns.

anti-bot browser captcha detection production proxies scraping

CDATA

CDATA is an XML section that tells the parser to treat the contents as raw text instead of markup. It’s mainly there so characters like < and & can appear without being escaped, which comes up a lot in RSS feeds, XML APIs, and embedded HTML or JavaScript.

xml parsing rss feeds markup

CDN

A CDN, or content delivery network, is a distributed layer of servers that caches and serves website assets closer to the user. In scraping, CDNs matter because they change how content is delivered, cached, rate-limited, and blocked, especially when providers like Cloudflare or CloudFront sit in front of the origin.

cdn infrastructure delivery caching cloudflare cloudfront blocking scraping

CDP (Chrome DevTools Protocol)

CDP is the low-level protocol Chrome and other Chromium-based browsers expose for remote control, usually over a WebSocket connection. It lets you do the same kinds of things DevTools does: inspect pages, run JavaScript, intercept network traffic, read cookies, and capture screenshots. In scraping, people use it because it gives more direct browser control than higher-level automation libraries.

browser chrome cdp automation rendering scraping

City-Level Routing

City-level routing means sending a scraping request through an IP located in a specific city instead of just picking a country or region. You use it when a site changes results, pricing, inventory, or anti-bot behavior based on the user’s apparent location, and country-level targeting is too blunt to be useful.

proxies routing geotargeting localization scraping

Concurrency

Concurrency is sending or processing multiple scraping tasks at the same time instead of waiting for each one to finish before starting the next. In scraping, that mostly means making many requests in flight at once so jobs finish faster, but pushing it too hard gets you rate-limited, blocked, or just creates a new reliability problem.

scraping performance async concurrency python scaling

Content-Type

Content-Type is an HTTP header that tells you what kind of data is in the request or response body, like HTML, JSON, XML, or an image. In scraping, it matters because the body might not be what you expected, and treating JSON like HTML or a PDF like text is how parsers break in production.

http headers content-type api scraping

Cookies

Cookies are small pieces of data a website stores and sends back on later requests to keep track of sessions, logins, preferences, and basic state. In scraping, they matter because a lot of sites stop working the moment you ignore them, reuse them badly, or lose them between requests.

cookies http sessions authentication web-scraping anti-bot

CORS

CORS, short for Cross-Origin Resource Sharing, is a browser security mechanism that controls whether JavaScript running on one origin can make requests to another. It matters a lot if you're scraping from frontend code, but it does not apply the same way to server-side scrapers, which is why people often hit it in the browser and then overcomplicate the fix.

cors browser security frontend http scraping

Crawling

Crawling is the process of discovering pages by starting from one or more URLs, fetching them, extracting links, and following those links across a site or across the web. It is about finding what exists and what changed; scraping is the separate step where you extract the data you actually care about.

crawling discovery scraping spider web

CSS

CSS usually means Cascading Style Sheets, the language browsers use to control how HTML looks on the page. In scraping, though, people often mean CSS selectors: the pattern syntax used to find elements like buttons, links, product titles, or price blocks inside a document.

css selectors html dom parsing scraping
D
E
F
G

Geolocation

Geolocation is the location context a request appears to come from, usually at the country, region, or city level. In scraping, it matters because many sites change content, pricing, availability, or blocking behavior based on where they think the visitor is coming from.

geolocation proxy localization geo-blocking scraping

Geo-Targeting

Geo-targeting means sending requests from a specific country, region, or city so a website returns the version of the page that real users in that location would see. In scraping, this matters because prices, availability, search results, and even entire pages often change by geography, and if you ignore that, your data is wrong before you even start cleaning it.

geotargeting proxies localization scraping serp ecommerce

GraphQL

GraphQL is an API query language that lets a client ask for exactly the fields it wants instead of taking a fixed response shape from a REST endpoint. In scraping, it matters because many modern sites load data through GraphQL behind the frontend, which is often cleaner and more stable to work with than parsing constantly changing HTML.

graphql api json frontend scraping

Greylisting IP

Greylisting IP is when a site does not fully ban your IP, but quietly degrades or limits it because it looks suspicious. In scraping, this usually shows up as intermittent 403s, slower responses, CAPTCHA pages, empty results, or requests that work in a browser but fail from your scraper.

ip blocking proxy scraping anti-bot rate-limit
H

Headful Browser

A headful browser is a browser running with a visible UI, like a normal desktop Chrome or Firefox session. In scraping, people use it when sites behave differently in headless mode, when debugging is easier with a real window, or when they need browser behavior that looks more like an actual user session.

browser scraping automation headful rendering anti-bot

Headless Browser

A headless browser is a real browser running without a visible UI, usually controlled by code. In scraping, you use it when a site needs JavaScript execution, real rendering, or browser-like behavior that plain HTTP requests won’t handle reliably.

browser headless rendering javascript automation scraping

Honeypot

A honeypot is a trap used to catch bots or suspicious automation by exposing something a real user usually would not touch, like a hidden link, invisible form field, or fake endpoint. In scraping, hitting a honeypot is a fast way to get flagged because it tells the site you are parsing the page mechanically instead of behaving like a normal browser session.

scraping bot-detection anti-bot security crawler

HTTP

HTTP, short for Hypertext Transfer Protocol, is the basic request-response protocol browsers, APIs, and scrapers use to talk to web servers. In practice, it’s the layer where you send a request like GET or POST and get back a response with status codes, headers, and a body, which is why scraping usually starts here before the real mess begins.

http protocols web scraping networking

Hydration

Hydration is the step where client-side JavaScript takes server-rendered HTML and turns it into a live app by attaching state, event handlers, and component logic. For scraping, it matters because a lot of modern sites ship useful data in the page before hydration finishes, and that data is often easier to extract than waiting for the fully rendered UI.

javascript rendering frontend nextjs scraping
I
J

JA3/JA4 fingerprint

JA3 and JA4 are TLS client fingerprints used to identify patterns in how a client starts HTTPS connections. In scraping, they matter because many bot defenses use them to spot traffic from default Python HTTP stacks, headless tooling, or other non-browser clients before your request even gets to the page.

tls fingerprinting bot-detection anti-bot scraping https networking

Jitter

Jitter is a small random delay added to retries, request timing, or backoff so your traffic does not line up in neat bursts. In scraping, it matters because synchronized retries are a great way to hit the same rate limit again, especially when you run many workers, sessions, or accounts at once.

retries backoff rate-limits concurrency scraping

JSON

JSON stands for JavaScript Object Notation. It’s a plain text format for structured data, built from key-value pairs and arrays, and it’s what most scraping APIs return because machines can work with it without the usual HTML cleanup mess.

json data-format api scraping parsing

JSON-LD

JSON-LD is structured data embedded in a page, usually inside a <script type="application/ld+json"> tag. For scraping, it matters because sites often put clean entity data there: product details, article metadata, breadcrumbs, ratings, offers, and other fields that are much easier to parse than the visible HTML.

json-ld structured-data schema html parsing scraping seo

JSON Schema

JSON Schema is a way to define the shape of JSON data: which fields exist, what types they are, and what counts as valid. In scraping, it gives you a contract for the output so you get structured data you can actually rely on instead of vaguely shaped JSON that breaks downstream.

json schema validation structured-data extraction scraping

JWT

JWT stands for JSON Web Token: a compact token format used to send claims like user identity, expiration, and permissions between a client and a server. In scraping, you mostly run into JWTs when an API expects a Bearer token after login, especially on SPAs and mobile app backends.

auth jwt api headers login session security
O
P
R

RDFa

RDFa is a way to embed structured data directly into HTML using attributes on existing elements. In scraping, it matters when a page exposes metadata like product details, authorship, or schema markup in the DOM instead of putting it in JSON-LD.

structured-data html metadata schema seo parsing

Residential Proxy

A residential proxy routes your requests through IP addresses assigned by consumer internet providers, so the traffic looks like it is coming from a normal home user instead of a data center. In scraping, people use them because they get blocked less often on sites that score IP reputation aggressively, but they cost more and add another thing that can fail.

proxy residential scraping networking anti-bot

REST

REST, short for Representational State Transfer, is a common way to design web APIs around resources, HTTP methods, and standard status codes. In practice, it usually means you make predictable requests like GET, POST, PUT, and DELETE to URLs and get structured responses back, usually JSON.

api rest http json web

Robots.txt

Robots.txt is a text file on a website, usually served from /robots.txt, that tells crawlers which paths they are allowed or asked not to crawl. It is a crawler-facing policy file, not an enforcement mechanism, so decent bots read it and bad ones ignore it.

robots crawling compliance scraping crawler web

Rotating Proxies

Rotating proxies are proxy networks that change the IP address used for outgoing requests, either on every request or on a defined schedule. In scraping, they help reduce bans, rate limits, and captchas, but they do not magically fix bad request patterns, broken sessions, or sloppy scraper behavior.

proxies scraping networking anti-bot infrastructure

RSS

RSS is an XML-based feed format websites use to publish new content in a structured, machine-readable way. For scraping, it’s the easy path when a site gives you one: fewer moving parts, less breakage, and no need to render pages just to detect what changed.

rss feeds xml scraping discovery monitoring
S

Sandbox

A sandbox is a website or environment built for testing scraping code without the usual production mess. It gives you stable pages, predictable structure, and explicit permission to scrape, which makes it useful for learning, debugging selectors, and checking whether your tooling works before you point it at real sites.

sandbox testing debugging learning scraping

ScrapeRouter

a web scraping API that routes each request through the scraper and proxy setup that fit the target, then returns one normalized response. It is built for production scraping, where the real problem is not fetching one page once, but keeping many targets working without filling your codebase with routing, rendering, retry, and provider-specific logic.

SDK

SDK stands for software development kit: a packaged set of code, helpers, and documentation that makes it easier to use an API or platform from your language of choice. In scraping, an SDK usually saves you from hand-rolling request signing, retries, headers, and response parsing every time.

sdk api developer-tools integration client

SERP

SERP stands for search engine results page: the page a search engine returns after someone searches for a query. In scraping, it usually means collecting structured data from Google, Bing, or other search result pages without manually parsing a mess of ads, maps, snippets, and ranking changes every week.

serp search seo scraping google ranking

Session

A session is the state a site keeps across multiple requests so it can treat them as coming from the same user flow. In scraping, that usually means cookies, auth state, cart state, CSRF tokens, or other bits that need to persist, otherwise things work for one request and then quietly break on the next.

session cookies auth state http scraping

Shadow DOM

Shadow DOM is a browser feature that lets a component keep its HTML and CSS inside an isolated subtree, so normal selectors often can’t see or reach it. For scraping, that usually means the element exists on the page but your parser or selector still comes back empty unless you explicitly traverse into the shadow root.

javascript browser dom rendering selectors frontend

Sitemap.xml

Sitemap.xml is an XML file that lists URLs a site wants crawlers to find, often with metadata like last modified date, update frequency, or priority. For scraping, it is one of the simplest ways to discover pages at scale without clicking through navigation, category trees, or endless pagination.

sitemap xml discovery crawling seo scraping

SNI

SNI stands for Server Name Indication, a TLS extension that lets a client say which hostname it wants before the HTTPS connection is fully set up. In practice, that matters because many sites share the same IP, and if the SNI value is wrong, missing, or blocked, the request can fail before scraping even gets to HTTP.

tls https networking proxy transport

SOCKS5 proxy

A SOCKS5 proxy is a low-level proxy protocol that forwards network traffic without rewriting it the way HTTP proxies do. It works with more kinds of traffic, including HTTPS, WebSockets, and non-HTTP connections, which makes it useful in scraping setups where HTTP proxies start breaking in annoying ways.

proxy socks5 networking scraping infrastructure

SSR

SSR stands for server-side rendering: the server builds the HTML before sending it to the browser. For scraping, that matters because SSR pages often expose the data you need directly in the initial response, so you can skip the whole browser-rendering mess.

rendering javascript frontend scraping html web
T
W

WAF

A WAF, or Web Application Firewall, sits in front of a site and filters requests before they reach the application. In scraping, it is one of the main things blocking you in production: rate limits, CAPTCHAs, bot checks, fingerprinting, and silent 403s often come from the WAF, not the site itself.

waf anti-bot blocking scraping infrastructure security

WebDriver

WebDriver is the browser automation interface tools like Selenium use to control a real browser. In scraping, it matters when the page only renders data after JavaScript runs or when you need to click, scroll, type, or wait for elements like an actual user session.

webdriver selenium browser automation javascript rendering scraping

WebGL

WebGL is the browser API that lets websites render GPU-accelerated 2D and 3D graphics inside a page, usually through a canvas element, without plugins. For scraping, it matters less because of the graphics themselves and more because WebGL exposes hardware and rendering details that anti-bot systems use for fingerprinting.

webgl browser fingerprinting anti-bot scraping automation

WebRTC

WebRTC is a browser technology for real-time peer-to-peer communication, usually used for audio, video, and direct data transfer between clients. In scraping, it matters less because you need to scrape WebRTC itself and more because it can leak your real IP, bypass proxy assumptions, and make browser automation behave differently than a plain HTTP client.

webrtc browser networking fingerprinting proxy anti-bot

Whitelisting

Whitelisting means explicitly allowing a specific IP, API key, domain, or account to access something that would otherwise be blocked or rate-limited. In scraping, it usually comes up when a target, proxy provider, or internal system only accepts traffic from approved sources, which is fine until your IPs change and things quietly break.

security access ip proxies authentication infrastructure scraping