43 terms
A backconnect proxy is a proxy endpoint that sits in front of a larger IP pool and routes your requests through different exit IPs behind the same connection point. The point is not just rotation, it is that your application talks to one proxy address while the provider handles switching IPs underneath, which is useful when you need to spread requests and avoid getting blocked.
Blacklisting is when a site marks your IP, session, fingerprint, or account as untrusted and starts blocking, throttling, or challenging your requests. In scraping, this usually happens after repeated requests, bad proxy hygiene, or behavior that makes your traffic easy to detect. The problem is not just getting blocked once; it is keeping requests working without constantly rewriting your scraping setup.
Bounded randomness means adding variation to scraper behavior, but within sensible limits instead of using completely fixed timing or totally chaotic delays. In practice, it is how you avoid obviously bot-like patterns without making runs impossible to reason about, debug, or operate at scale.
Canvas fingerprinting is a browser identification technique that uses the HTML5 canvas API to draw hidden text or images, then reads back the rendered result to help identify a device or browser. In scraping, it matters because anti-bot systems use it as one signal to tell apart real browsers, headless browsers, and poorly configured automation.
A CAPTCHA is a challenge a site shows to decide whether the visitor is probably human or automated, usually after it sees behavior it does not like. In scraping, it is less a standalone problem than a signal: your request path, IP quality, browser fingerprint, or request pattern is getting flagged.
CDP is the protocol Chrome and other Chromium-based browsers expose for remote control and inspection, usually over a WebSocket connection. In scraping, it gives you low-level access to things like page navigation, JavaScript execution, network events, cookies, and screenshots without going through a higher-level automation library.
City-level routing means sending a scraping request through an IP located in a specific city, not just a country or region. The point is to see what the target site shows to users in that local market, which matters for things like localized pricing, availability, maps, ads, and search results.
CORS, or Cross-Origin Resource Sharing, is a browser security mechanism that controls whether JavaScript running on one origin can make requests to another origin. The important part for scraping is that this is mostly a browser problem, not a server-to-server problem, which is why a request blocked in frontend JavaScript may work fine from a backend scraper.
CSS means Cascading Style Sheets, the language browsers use to control how HTML looks on the page. In scraping, people also say “CSS” as shorthand for CSS selectors, which are patterns used to find elements in the DOM without writing XPath.
A datacenter proxy routes your requests through IPs that come from cloud or hosting providers, not real household devices. They are usually faster, cheaper, and easier to scale than residential proxies, but they are also easier for targets to detect and block.
DOM stands for Document Object Model. It is the tree-like structure a browser builds from an HTML or XML document, where each element, attribute, and text node can be inspected, selected, or changed. In scraping, this is usually what you query when you use CSS selectors or XPath.
Dwell time usually means how long a user stays on a page after clicking a search result before going back to the search results or leaving. In scraping and analytics, it is basically a time-on-page signal, but the exact definition depends on how you measure the session and what counts as an exit or return.
Geo-targeting means making a request appear to come from a specific country, region, or city so the target site returns the local version of the page. In scraping, this matters because pricing, search results, inventory, consent flows, and even whether the page is accessible can change based on location.
Greylisting IP means a site does not fully block your IP, but it treats it as suspicious and degrades access. In practice, that usually means more CAPTCHAs, slower responses, intermittent 403s, empty pages, or requests that work sometimes and fail under higher volume.
A headful browser is a browser running with a visible user interface, like a normal desktop Chrome or Firefox session. In scraping, teams use headful mode when a target behaves differently under automation, needs full interaction, or is harder to get through in headless mode. It usually costs more CPU, memory, and time, so it is not something you want to use everywhere by default.
A headless browser is a real browser running without a visible UI, usually controlled by code through tools like Playwright, Puppeteer, or Selenium. In scraping, you use it when a plain HTTP request is not enough because the page depends on JavaScript, browser APIs, or client-side rendering.
A honeypot is a trap a website sets to catch bots or scrapers, usually by adding links, form fields, or elements that normal users never interact with. If your scraper clicks, submits, or follows them, you make yourself easy to detect and often end up blocked.
HTTP, or Hypertext Transfer Protocol, is the basic request-response protocol browsers, APIs, and scrapers use to talk to web servers. In practice, it is the layer where you send a request like GET or POST, get back a response with headers, status codes, and a body, and then find out whether the target gives you the page, blocks you, redirects you, or rate limits you.
An IFrame is an HTML element that loads another page inside the current page. For scraping, the main issue is that the data you want may not be in the parent HTML at all, but inside the iframe’s separate document, which often means a different request path, different DOM, and sometimes cross-origin restrictions.
In scraping, ISP usually refers to an ISP proxy: an IP address announced through an internet service provider but often hosted on dedicated infrastructure. In practice, teams use them when datacenter IPs get blocked too easily but residential traffic is too expensive or slow.
Proof of Work (PoW) is a gate a server puts in front of a request, forcing the client to spend some CPU time solving a small challenge before it gets through. The point is not strong authentication. The point is making large-scale automated traffic more expensive, so cheap scraping and abuse stop being cheap.
A proxy is an intermediate server that sends a request on your behalf, so the target site sees the proxy IP instead of yours. In scraping, proxies are mainly used to spread requests across different IPs, reduce blocking, and reach sites from specific geographies or network types.
A residential proxy sends your requests through real consumer IP addresses assigned by internet service providers, which makes the traffic look more like normal user traffic than datacenter IPs. They are usually better for targets with stricter bot detection, but they cost more and add more variability in speed, quality, and session stability.
REST, short for Representational State Transfer, is a common way to design web APIs around resources, standard HTTP methods, and predictable URLs. In practice, it usually means you make normal HTTP requests like GET or POST and get structured responses back, often JSON.
Rotating proxies are proxy networks that change the IP address used for outgoing requests over time, per request, or by session. The point is not just hiding one IP — it is spreading traffic so you hit fewer rate limits, bans, and detection rules when scraping at scale.
a web scraping API that routes each request through the scraper and proxy setup that fit the target, then returns one normalized response. It is built for production scraping, where the real problem is not fetching one page once, but keeping many targets working without filling your codebase with routing, rendering, retry, and provider-specific logic.
An SDK, or software development kit, is a packaged set of tools, code, and documentation that makes it easier to integrate with a service. In scraping, an SDK usually wraps the raw API so you do not have to hand-build every request, auth header, retry, or response parser yourself.
Shadow DOM is a browser feature that lets a component keep part of its HTML, CSS, and JavaScript isolated from the main page DOM. For scraping, the problem is simple: elements inside a shadow root usually do not show up in normal selectors, so code that works on regular pages can fail even though the content is visibly there.
SNI, short for Server Name Indication, is a TLS extension where the client tells the server which hostname it wants during the TLS handshake. In practice, this matters because many sites share one IP across multiple domains, and if SNI is missing or wrong, you often get the wrong certificate, handshake failures, or a request that looks suspicious.
Tarpitting is a defensive technique where a server intentionally slows down, stalls, or misleads a client instead of blocking it outright. In scraping, it is used to waste bot time, tie up connections, increase costs, or feed low-value responses, which means a request can look "successful" while still being a failure in practice.
TCP, short for Transmission Control Protocol, is the transport protocol that makes sure data sent between a client and server arrives reliably and in order. In web scraping, it sits underneath HTTP and HTTPS, so every request starts with a TCP connection before any page data is transferred.
TLS, short for Transport Layer Security, is the protocol that secures HTTPS connections by encrypting traffic between a client and a server. In scraping, TLS is not just about encryption anymore; how a client negotiates TLS can also affect whether a request looks like a real browser or gets flagged as automation.
A TLS handshake is the first part of an HTTPS connection, where the client and server agree on how to encrypt traffic and establish session keys. In scraping, it matters because sites can inspect handshake details before any HTTP request is processed, which means your client can look suspicious even if your headers and cookies look fine.
WAF stands for Web Application Firewall. In scraping, it usually means the layer sitting in front of a site that inspects requests and blocks traffic that looks automated, abusive, or just unusual. The annoying part is that a request can look fine at the HTTP level and still get challenged or blocked by the WAF.
WebGL is a browser API for rendering 2D and 3D graphics through the GPU, usually inside an HTML canvas, without a plugin. For scraping, the important part is not the graphics API itself but the fact that WebGL support and GPU behavior are often used in bot detection and browser fingerprinting.
WebRTC is a browser technology for real-time peer-to-peer communication, usually used for audio, video, and direct data transfer between clients. For scraping, the part that matters is less the media stack and more the fact that WebRTC can expose network details like local or public IPs, which can break anonymity if you are using proxies or browser automation.
Whitelisting means explicitly allowing a known IP, range, API key, or client to access a system that would otherwise block or challenge it. In scraping, this usually comes up when a target, proxy provider, or internal API only accepts traffic from approved sources.