Glossary

100 terms

A B C D D E F G H H H I J K L M N O P P R S S T T T V W X

API

An API, or application programming interface, is a defined way for one piece of software to talk to another. In practice, it usually means you send a request to a service and get structured data back, instead of poking at a website or system by hand.

api http integration backend scraping

ASN

ASN stands for Autonomous System Number: an identifier for a network operated by an ISP, cloud provider, or large organization on the internet. In scraping, ASN matters because many sites score traffic by network identity, not just by IP, so requests from a known data center ASN often get challenged faster than traffic from residential or ISP ASNs.

networking proxies anti-bot ip routing scraping

Backconnect Proxy

A backconnect proxy is a proxy endpoint that stays the same on your side while the provider rotates the exit IP behind it. In practice, it is a convenience layer for large proxy pools, usually used in scraping when you need rotation without constantly updating proxy lists yourself.

proxy rotation scraping networking infrastructure

Backoff

Backoff is the practice of waiting before retrying a request after a failure, block, or rate limit. In scraping, it helps you avoid hammering a site when it is already telling you to slow down, which improves stability and lowers the chance of getting banned.

retries rate-limits scraping reliability http

Bandwidth

Bandwidth is the amount of data your scraper sends and receives over the network. In scraping, it directly affects cost, speed, and how noisy your crawler looks to the target site, especially when you're pulling full pages, images, scripts, and retries you didn't actually need.

bandwidth performance cost network scraping optimization

Base64

Base64 is a way to encode binary data as plain text using a limited set of ASCII characters. You see it all over the web in things like image blobs, tokens, API payloads, and sometimes scraped responses where the useful data is wrapped in an extra decoding step.

encoding data api parsing scraping

Bearer token

A bearer token is an access token sent in the Authorization header to prove the client is allowed to use an API. Whoever has the token can use it, so in practice it works like a password for API requests and needs to be handled with the same care.

auth api http security tokens scraping

Behavioral detection

Behavioral detection is a class of anti-bot checks that looks at how a visitor behaves, not just what IP, headers, or fingerprint they show up with. It flags automation when timing, scrolling, clicks, navigation flow, and page interaction patterns look too clean, too fast, or too mechanically consistent to be a real user.

antibot detection browser scraping automation

Blacklisting

Blacklisting is when a site marks your IP, session, account, or request pattern as untrusted and starts blocking, throttling, or challenging you. In scraping, it usually happens because your traffic looks automated, too aggressive, or just too repetitive over time.

blocking proxies anti-bot scraping sessions fingerprints

Bounded Randomness

Bounded randomness means adding variation to scraper behavior inside sensible limits instead of using fixed, perfectly repeatable timing. In practice, that usually means random delays, dwell times, and request spacing that stay within a defined range, so you look less bot-like without turning the job into chaos.

scraping timing antibot throttling requests automation

Browser context

A browser context is an isolated browser session inside a single browser process. It gives you separate cookies, storage, and session state without paying the cost of launching a whole new browser every time, which matters a lot once you stop scraping one page at a time and start running real workloads.

browser playwright session rendering javascript automation scraping

Browser profile

A browser profile is the saved state a browser carries between sessions: cookies, local storage, cache, login state, preferences, and sometimes fingerprint-related settings. In scraping and automation, profiles matter when you need a session to keep behaving like the same user instead of starting from zero on every run.

browser automation session cookies state scraping

Burst limit

A burst limit is the maximum number of requests you can send in a short spike before rate limiting kicks in. It matters because many systems allow brief bursts above the steady request rate, but they still block you if that spike is too large or happens too often.

rate-limiting scraping throughput concurrency 429 throttling

Canonical URL

A canonical URL is the preferred URL for a page when the same content is available at multiple URLs. It tells search engines which version should be treated as the main one, so ranking signals do not get split across duplicates, parameterized URLs, or near-identical versions.

seo url html crawling indexing

Canvas Fingerprinting

Canvas fingerprinting is a browser fingerprinting technique where a site uses the HTML5 canvas API to draw hidden text or images, then hashes the rendered result to help identify a browser. It matters in scraping because those rendering differences can be used as a tracking signal, especially when your browser setup is inconsistent, headless, or obviously automated.

fingerprinting browser anti-bot scraping detection headless

CAPTCHA

A CAPTCHA is a challenge a site shows to figure out whether the visitor is a human or an automated script. In scraping, it usually means the target thinks your traffic looks suspicious, often because of bad IPs, broken browser fingerprints, or bot-like request patterns.

anti-bot browser captcha detection production proxies scraping

CDATA

CDATA is an XML section that tells the parser to treat the contents as raw text instead of markup. It’s mainly there so characters like < and & can appear without being escaped, which comes up a lot in RSS feeds, XML APIs, and embedded HTML or JavaScript.

xml parsing rss feeds markup

CDN

A CDN, or content delivery network, is a distributed layer of servers that caches and serves website assets closer to the user. In scraping, CDNs matter because they change how content is delivered, cached, rate-limited, and blocked, especially when providers like Cloudflare or CloudFront sit in front of the origin.

cdn infrastructure delivery caching cloudflare cloudfront blocking scraping

CDP (Chrome DevTools Protocol)

CDP is the low-level protocol Chrome and other Chromium-based browsers expose for remote control, usually over a WebSocket connection. It lets you do the same kinds of things DevTools does: inspect pages, run JavaScript, intercept network traffic, read cookies, and capture screenshots. In scraping, people use it because it gives more direct browser control than higher-level automation libraries.

browser chrome cdp automation rendering scraping

City-Level Routing

City-level routing means sending a scraping request through an IP located in a specific city instead of just picking a country or region. You use it when a site changes results, pricing, inventory, or anti-bot behavior based on the user’s apparent location, and country-level targeting is too blunt to be useful.

proxies routing geotargeting localization scraping

Concurrency

Concurrency is sending or processing multiple scraping tasks at the same time instead of waiting for each one to finish before starting the next. In scraping, that mostly means making many requests in flight at once so jobs finish faster, but pushing it too hard gets you rate-limited, blocked, or just creates a new reliability problem.

scraping performance async concurrency python scaling

Content-Type

Content-Type is an HTTP header that tells you what kind of data is in the request or response body, like HTML, JSON, XML, or an image. In scraping, it matters because the body might not be what you expected, and treating JSON like HTML or a PDF like text is how parsers break in production.

http headers content-type api scraping

Cookies

Cookies are small pieces of data a website stores and sends back on later requests to keep track of sessions, logins, preferences, and basic state. In scraping, they matter because a lot of sites stop working the moment you ignore them, reuse them badly, or lose them between requests.

cookies http sessions authentication web-scraping anti-bot

CORS

CORS, short for Cross-Origin Resource Sharing, is a browser security mechanism that controls whether JavaScript running on one origin can make requests to another. It matters a lot if you're scraping from frontend code, but it does not apply the same way to server-side scrapers, which is why people often hit it in the browser and then overcomplicate the fix.

cors browser security frontend http scraping

Crawling

Crawling is the process of discovering pages by starting from one or more URLs, fetching them, extracting links, and following those links across a site or across the web. It is about finding what exists and what changed; scraping is the separate step where you extract the data you actually care about.

crawling discovery scraping spider web

CSS

CSS usually means Cascading Style Sheets, the language browsers use to control how HTML looks on the page. In scraping, though, people often mean CSS selectors: the pattern syntax used to find elements like buttons, links, product titles, or price blocks inside a document.

css selectors html dom parsing scraping

Datacenter Proxy

A datacenter proxy is an IP address served from a cloud or hosting provider, not a real household or mobile network. They’re fast, cheap, and great for high-volume scraping on easier targets, but they also get flagged more often because sites know datacenter IP ranges and block them aggressively.

proxy datacenter scraping networking anti-bot

DNS

DNS, or the Domain Name System, translates domain names like example.com into IP addresses that machines can actually connect to. In scraping, it is one of those layers people forget about until requests start failing, resolving slowly, or hitting the wrong infrastructure after a target changes providers or protection.

dns networking infrastructure scraping web

Docker

Docker packages an application and its runtime into a container so it runs the same way on your laptop, CI, and production. For scraping, that matters because browsers, system libraries, fonts, and anti-bot workarounds tend to break in slightly different ways on every machine if you do not pin the environment.

docker deployment containers infrastructure devops scraping

DOM

DOM stands for Document Object Model. It’s the tree-like structure a browser builds from a page’s HTML, where elements, attributes, and text become nodes you can inspect, query, and manipulate with JavaScript or a scraper.

dom html parsing selectors rendering scraping

dwell time

Dwell time is how long a user stays on a page before going back to the search results or leaving. In search and SEO conversations, people use it as a rough signal that the page actually answered the query instead of making the user bounce immediately.

seo search analytics metrics engagement

ETag

An ETag is an HTTP response header that identifies a specific version of a resource. Browsers, CDNs, and bots use it for conditional requests, so the server can return 304 Not Modified instead of sending the full response again when nothing changed.

http headers caching scraping api

ETL

ETL stands for extract, transform, load: pull data from a source, clean or reshape it, then write it somewhere useful like a database, warehouse, or queue. In scraping, this is usually the part after the request succeeds, where raw HTML or API responses get turned into structured data that downstream systems can actually use.

etl data pipeline scraping processing

Event loop

An event loop is the part of a runtime that keeps track of async work and decides what runs next without blocking the whole program. In scraping, it matters because network requests, browser automation, waits, retries, and timeouts all pile up fast, and if you misuse the loop you get slow crawlers, stuck tasks, or weird concurrency bugs.

async python javascript concurrency runtime scraping

Exponential backoff

Exponential backoff is a retry strategy where you wait longer after each failed request, typically doubling the delay each time. It helps scrapers recover from temporary failures like rate limits, timeouts, and overloaded targets without hammering the site and making the problem worse.

retries reliability rate-limits scraping http timeouts

Font fingerprinting

Font fingerprinting is a browser identification technique that checks which fonts are available on a device and how text renders with them. Anti-bot systems use it as one signal in a larger fingerprint because the font set and rendering behavior often differ across operating systems, browsers, and automation setups.

fingerprinting browser anti-bot scraping headless detection

Forward proxy

A forward proxy is a server that sits between your app and the target website, sending requests on your behalf so the site sees the proxy IP instead of your origin IP. In scraping, this is the normal kind of proxy people mean: you route outbound traffic through it for IP rotation, geo-targeting, access control, or just to avoid getting blocked immediately.

proxy networking scraping infrastructure ip routing

Geolocation

Geolocation is the location context a request appears to come from, usually at the country, region, or city level. In scraping, it matters because many sites change content, pricing, availability, or blocking behavior based on where they think the visitor is coming from.

geolocation proxy localization geo-blocking scraping

Geo-Targeting

Geo-targeting means sending requests from a specific country, region, or city so a website returns the version of the page that real users in that location would see. In scraping, this matters because prices, availability, search results, and even entire pages often change by geography, and if you ignore that, your data is wrong before you even start cleaning it.

geotargeting proxies localization scraping serp ecommerce

GraphQL

GraphQL is an API query language that lets a client ask for exactly the fields it wants instead of taking a fixed response shape from a REST endpoint. In scraping, it matters because many modern sites load data through GraphQL behind the frontend, which is often cleaner and more stable to work with than parsing constantly changing HTML.

graphql api json frontend scraping

Greylisting IP

Greylisting IP is when a site does not fully ban your IP, but quietly degrades or limits it because it looks suspicious. In scraping, this usually shows up as intermittent 403s, slower responses, CAPTCHA pages, empty results, or requests that work in a browser but fail from your scraper.

ip blocking proxy scraping anti-bot rate-limit

HAR

HAR stands for HTTP Archive, a JSON file format that records everything a browser loaded for a page: requests, responses, headers, timings, redirects, and more. In scraping, HAR files are useful because they show what the site actually does in the browser, which is often the fastest way to find hidden APIs, auth flows, and the request sequence you need to reproduce.

har http debugging browser network api reverse-engineering scraping

headers

Headers are the key-value fields sent with an HTTP request or response that tell the other side what you want, who you are, and how to handle the connection. In scraping, they matter because bad or missing headers are one of the fastest ways to look fake and get blocked, even if your parser is fine.

http headers requests scraping anti-bot web

Headful Browser

A headful browser is a browser running with a visible UI, like a normal desktop Chrome or Firefox session. In scraping, people use it when sites behave differently in headless mode, when debugging is easier with a real window, or when they need browser behavior that looks more like an actual user session.

browser scraping automation headful rendering anti-bot

Headless Browser

A headless browser is a real browser running without a visible UI, usually controlled by code. In scraping, you use it when a site needs JavaScript execution, real rendering, or browser-like behavior that plain HTTP requests won’t handle reliably.

browser headless rendering javascript automation scraping

Honeypot

A honeypot is a trap used to catch bots or suspicious automation by exposing something a real user usually would not touch, like a hidden link, invisible form field, or fake endpoint. In scraping, hitting a honeypot is a fast way to get flagged because it tells the site you are parsing the page mechanically instead of behaving like a normal browser session.

scraping bot-detection anti-bot security crawler

HTTP

HTTP, short for Hypertext Transfer Protocol, is the basic request-response protocol browsers, APIs, and scrapers use to talk to web servers. In practice, it’s the layer where you send a request like GET or POST and get back a response with status codes, headers, and a body, which is why scraping usually starts here before the real mess begins.

http protocols web scraping networking

Hydration

Hydration is the step where client-side JavaScript takes server-rendered HTML and turns it into a live app by attaching state, event handlers, and component logic. For scraping, it matters because a lot of modern sites ship useful data in the page before hydration finishes, and that data is often easier to extract than waiting for the fully rendered UI.

javascript rendering frontend nextjs scraping

IFrame

An IFrame is an HTML element that embeds one web page inside another. In scraping, this matters because the data you want often is not in the main page HTML at all, but loaded from the iframe's src as a separate document with its own requests, cookies, and sometimes its own anti-bot problems.

html dom browser javascript scraping

Infinite scroll

Infinite scroll is a page pattern where more content loads automatically as you scroll instead of exposing numbered pages or a visible Next button. For scrapers, that means the data is often fetched by JavaScript in batches, so grabbing the first HTML response is not enough.

pagination javascript rendering dynamic-content scraping

IP reputation

IP reputation is the trust score websites implicitly assign to the IPs sending requests. In scraping, it decides whether your traffic gets clean responses, soft blocks, captchas, throttling, or silent junk even when your code is fine.

ip proxies blocking reliability anti-bot networking scraping

ISP

In scraping, ISP usually refers to an ISP proxy: an IP address announced by an internet service provider but hosted on fast server infrastructure. It sits in the middle between datacenter and residential proxies: cleaner reputation than datacenter in many cases, but cheaper and more stable than true residential traffic.

proxy networking scraping infrastructure anti-bot

JA3/JA4 fingerprint

JA3 and JA4 are TLS client fingerprints used to identify patterns in how a client starts HTTPS connections. In scraping, they matter because many bot defenses use them to spot traffic from default Python HTTP stacks, headless tooling, or other non-browser clients before your request even gets to the page.

tls fingerprinting bot-detection anti-bot scraping https networking

Jitter

Jitter is a small random delay added to retries, request timing, or backoff so your traffic does not line up in neat bursts. In scraping, it matters because synchronized retries are a great way to hit the same rate limit again, especially when you run many workers, sessions, or accounts at once.

retries backoff rate-limits concurrency scraping

JSON

JSON stands for JavaScript Object Notation. It’s a plain text format for structured data, built from key-value pairs and arrays, and it’s what most scraping APIs return because machines can work with it without the usual HTML cleanup mess.

json data-format api scraping parsing

JSON-LD

JSON-LD is structured data embedded in a page, usually inside a <script type="application/ld+json"> tag. For scraping, it matters because sites often put clean entity data there: product details, article metadata, breadcrumbs, ratings, offers, and other fields that are much easier to parse than the visible HTML.

json-ld structured-data schema html parsing scraping seo

JSON Schema

JSON Schema is a way to define the shape of JSON data: which fields exist, what types they are, and what counts as valid. In scraping, it gives you a contract for the output so you get structured data you can actually rely on instead of vaguely shaped JSON that breaks downstream.

json schema validation structured-data extraction scraping

JWT

JWT stands for JSON Web Token: a compact token format used to send claims like user identity, expiration, and permissions between a client and a server. In scraping, you mostly run into JWTs when an API expects a Bearer token after login, especially on SPAs and mobile app backends.

auth jwt api headers login session security

Keep-Alive

Keep-Alive means keeping a network connection open instead of tearing it down after every request. In scraping, that usually means reusing the same TCP or HTTP connection for multiple requests to the same site, which saves time and overhead and can make high-volume crawls a lot less wasteful.

networking http scraping performance sessions connections

local storage

Local storage is the browser's built-in key-value store that lets a website save data on a user's device and keep it across page reloads and browser restarts. In scraping, it matters because many modern apps read tokens, flags, or cached state from local storage, which means plain HTTP requests often miss part of what the app is actually doing.

browser javascript storage spa scraping

Mobile Proxy

A mobile proxy sends your requests through IPs assigned by mobile carriers, usually from 3G, 4G, or 5G networks. They tend to get blocked less aggressively than datacenter IPs, but they are slower, more expensive, and harder to use well in production.

proxy mobile scraping networking anti-bot

network tab

The network tab is the browser developer tools panel that shows every request a page makes: HTML, JSON, XHR, fetch calls, images, headers, cookies, and responses. For scraping, this is often the fastest way to find the real data source instead of guessing from messy front-end HTML or clicking around in Selenium.

browser debugging requests api javascript scraping

OAuth 2.0

OAuth 2.0 is an authorization framework that lets an app get limited access to a user’s account or data without handling the user’s password directly. In practice, it’s the thing behind "Sign in with Google" and a lot of API access flows, and it matters in scraping because authenticated sessions often depend on short-lived tokens, redirects, scopes, and refresh logic.

auth oauth api tokens login security sessions

OCR

OCR, or optical character recognition, turns text inside images, screenshots, or PDFs into machine-readable text. In scraping, you use it when the data is visible on the page but not actually present in the HTML, which is common with scanned documents, captcha-like image text, and screenshot-based workflows.

ocr documents images pdf parsing extraction

OpenGraph

OpenGraph is a set of HTML meta tags that tells platforms like Facebook, LinkedIn, Slack, and Discord how a page should look when someone shares its URL. It controls things like the title, description, image, and canonical URL used in link previews, which makes it a common target when scraping page metadata.

metadata html seo social parsing

pooling

Pooling is the practice of keeping a shared set of reusable resources instead of creating a fresh one for every request. In scraping, that usually means connection pooling or proxy pooling: reusing TCP sessions to cut overhead, or rotating through a pool of IPs so you do not burn a single address and get blocked immediately.

scraping proxies networking performance reliability

Proof of Work (PoW)

Proof of Work (PoW) is a system where a client has to spend some real compute effort before a request is accepted. On the web, it’s often used as bot friction: a browser or scraper must solve a small CPU or cryptographic challenge first, which is cheap for one human visit but expensive when you’re firing thousands of requests.

anti-bot security scraping web infrastructure

Proxy

A proxy is an intermediary server that sends requests on your behalf, so the target site sees the proxy IP instead of yours. In scraping, proxies are mainly used to reduce IP-based blocking, spread traffic, and make requests appear from specific networks or countries, but proxies alone do not solve rendering, fingerprinting, or rate-limit problems.

proxy networking scraping anti-bot infrastructure

RDFa

RDFa is a way to embed structured data directly into HTML using attributes on existing elements. In scraping, it matters when a page exposes metadata like product details, authorship, or schema markup in the DOM instead of putting it in JSON-LD.

structured-data html metadata schema seo parsing

Residential Proxy

A residential proxy routes your requests through IP addresses assigned by consumer internet providers, so the traffic looks like it is coming from a normal home user instead of a data center. In scraping, people use them because they get blocked less often on sites that score IP reputation aggressively, but they cost more and add another thing that can fail.

proxy residential scraping networking anti-bot

REST

REST, short for Representational State Transfer, is a common way to design web APIs around resources, HTTP methods, and standard status codes. In practice, it usually means you make predictable requests like GET, POST, PUT, and DELETE to URLs and get structured responses back, usually JSON.

api rest http json web

Robots.txt

Robots.txt is a text file on a website, usually served from /robots.txt, that tells crawlers which paths they are allowed or asked not to crawl. It is a crawler-facing policy file, not an enforcement mechanism, so decent bots read it and bad ones ignore it.

robots crawling compliance scraping crawler web

Rotating Proxies

Rotating proxies are proxy networks that change the IP address used for outgoing requests, either on every request or on a defined schedule. In scraping, they help reduce bans, rate limits, and captchas, but they do not magically fix bad request patterns, broken sessions, or sloppy scraper behavior.

proxies scraping networking anti-bot infrastructure

RSS

RSS is an XML-based feed format websites use to publish new content in a structured, machine-readable way. For scraping, it’s the easy path when a site gives you one: fewer moving parts, less breakage, and no need to render pages just to detect what changed.

rss feeds xml scraping discovery monitoring

Sandbox

A sandbox is a website or environment built for testing scraping code without the usual production mess. It gives you stable pages, predictable structure, and explicit permission to scrape, which makes it useful for learning, debugging selectors, and checking whether your tooling works before you point it at real sites.

sandbox testing debugging learning scraping

ScrapeRouter

a web scraping API that routes each request through the scraper and proxy setup that fit the target, then returns one normalized response. It is built for production scraping, where the real problem is not fetching one page once, but keeping many targets working without filling your codebase with routing, rendering, retry, and provider-specific logic.

SDK

SDK stands for software development kit: a packaged set of code, helpers, and documentation that makes it easier to use an API or platform from your language of choice. In scraping, an SDK usually saves you from hand-rolling request signing, retries, headers, and response parsing every time.

sdk api developer-tools integration client

SERP

SERP stands for search engine results page: the page a search engine returns after someone searches for a query. In scraping, it usually means collecting structured data from Google, Bing, or other search result pages without manually parsing a mess of ads, maps, snippets, and ranking changes every week.

serp search seo scraping google ranking

Session

A session is the state a site keeps across multiple requests so it can treat them as coming from the same user flow. In scraping, that usually means cookies, auth state, cart state, CSRF tokens, or other bits that need to persist, otherwise things work for one request and then quietly break on the next.

session cookies auth state http scraping

Shadow DOM

Shadow DOM is a browser feature that lets a component keep its HTML and CSS inside an isolated subtree, so normal selectors often can’t see or reach it. For scraping, that usually means the element exists on the page but your parser or selector still comes back empty unless you explicitly traverse into the shadow root.

javascript browser dom rendering selectors frontend

Sitemap.xml

Sitemap.xml is an XML file that lists URLs a site wants crawlers to find, often with metadata like last modified date, update frequency, or priority. For scraping, it is one of the simplest ways to discover pages at scale without clicking through navigation, category trees, or endless pagination.

sitemap xml discovery crawling seo scraping

SNI

SNI stands for Server Name Indication, a TLS extension that lets a client say which hostname it wants before the HTTPS connection is fully set up. In practice, that matters because many sites share the same IP, and if the SNI value is wrong, missing, or blocked, the request can fail before scraping even gets to HTTP.

tls https networking proxy transport

SOCKS5 proxy

A SOCKS5 proxy is a low-level proxy protocol that forwards network traffic without rewriting it the way HTTP proxies do. It works with more kinds of traffic, including HTTPS, WebSockets, and non-HTTP connections, which makes it useful in scraping setups where HTTP proxies start breaking in annoying ways.

proxy socks5 networking scraping infrastructure

SSR

SSR stands for server-side rendering: the server builds the HTML before sending it to the browser. For scraping, that matters because SSR pages often expose the data you need directly in the initial response, so you can skip the whole browser-rendering mess.

rendering javascript frontend scraping html web

subnet

A subnet is a smaller network carved out of a larger IP network. In scraping, people usually care about subnets because many proxy IPs from the same subnet look related, which makes rotation less useful if a target is tracking network ranges instead of just single IPs.

networking proxies scraping ip infrastructure

Tarpitting

Tarpitting is a defensive trick where a server deliberately slows down or traps a client instead of blocking it cleanly. The point is to waste the scraper, spammer, or scanner’s time and resources, which matters because a slow failure can be more expensive than a fast one in production.

scraping security anti-bot networking performance timeouts

TCP

TCP, or Transmission Control Protocol, is the transport layer protocol that makes sure data arrives reliably and in order between a client and server. In scraping, it sits underneath HTTP and HTTPS, so when requests fail before you even get a response, the problem is often down at the TCP level, not in your parser or request code.

networking protocols http tcp scraping

throttling

Throttling is deliberately limiting how fast your scraper sends requests so you do not overwhelm a site or trigger its defenses. In production, this means controlling request rate, concurrency, or both per domain, per IP, or per job, because "send everything as fast as possible" is how scrapers get blocked and budgets get wasted.

scraping rate-limit throughput concurrency anti-bot performance

TLS

TLS, short for Transport Layer Security, is the protocol that secures HTTPS connections by encrypting traffic between a client and a server. In scraping, it matters for more than encryption: sites also look at how your client performs the TLS handshake, and that fingerprint can be enough to get you blocked even if your requests are otherwise correct.

tls https security fingerprinting networking scraping bot-detection

TLS Handshake

A TLS handshake is the first part of an HTTPS connection where the client and server agree on how to talk securely and exchange the keys used for encryption. For scraping, it matters because sites can inspect handshake details before any HTTP request is processed, which means a bad or unusual client fingerprint can get you flagged early.

tls https fingerprinting scraping networking waf

TTFB

TTFB stands for Time to First Byte: the time between making a request and receiving the first byte of the response. It tells you how long the network, server, TLS handshake, redirects, and backend work took before anything actually started coming back, so when it is bad, everything after it starts late too.

performance http latency scraping monitoring

Virtual DOM

A virtual DOM is an in-memory representation of a page’s DOM that frameworks like React or Vue use to figure out what changed before updating the real browser DOM. It exists to make UI updates easier to manage, but for scraping the important part is simpler: sometimes the HTML you want only shows up after JavaScript builds it.

frontend javascript dom rendering scraping

VPN

A VPN, or virtual private network, routes your traffic through another server and gives you a different outward-facing IP address. For scraping, that can help with basic location testing or low-volume requests, but it is not a real replacement for scraping proxies because you usually do not get reliable rotation, concurrency, or control.

vpn proxy networking scraping infrastructure geo

WAF

A WAF, or Web Application Firewall, sits in front of a site and filters requests before they reach the application. In scraping, it is one of the main things blocking you in production: rate limits, CAPTCHAs, bot checks, fingerprinting, and silent 403s often come from the WAF, not the site itself.

waf anti-bot blocking scraping infrastructure security

WebDriver

WebDriver is the browser automation interface tools like Selenium use to control a real browser. In scraping, it matters when the page only renders data after JavaScript runs or when you need to click, scroll, type, or wait for elements like an actual user session.

webdriver selenium browser automation javascript rendering scraping

WebGL

WebGL is the browser API that lets websites render GPU-accelerated 2D and 3D graphics inside a page, usually through a canvas element, without plugins. For scraping, it matters less because of the graphics themselves and more because WebGL exposes hardware and rendering details that anti-bot systems use for fingerprinting.

webgl browser fingerprinting anti-bot scraping automation

WebRTC

WebRTC is a browser technology for real-time peer-to-peer communication, usually used for audio, video, and direct data transfer between clients. In scraping, it matters less because you need to scrape WebRTC itself and more because it can leak your real IP, bypass proxy assumptions, and make browser automation behave differently than a plain HTTP client.

webrtc browser networking fingerprinting proxy anti-bot

Whitelisting

Whitelisting means explicitly allowing a specific IP, API key, domain, or account to access something that would otherwise be blocked or rate-limited. In scraping, it usually comes up when a target, proxy provider, or internal system only accepts traffic from approved sources, which is fine until your IPs change and things quietly break.

security access ip proxies authentication infrastructure scraping

XHR

XHR stands for XMLHttpRequest, the browser API used by JavaScript to make HTTP requests in the background without reloading the page. In scraping, people often say “watch the XHRs” because those requests often return the actual structured data, which is a lot cleaner than fighting brittle HTML.

xhr javascript browser network api json scraping

XPath

XPath is a query language for selecting nodes inside XML or HTML documents using path-like expressions. In scraping, it’s mainly used to target elements precisely without looping through the whole DOM by hand, though brittle XPath selectors can break fast when a site’s structure shifts.

xpath selectors html xml parsing scraping