100 terms
An API, or application programming interface, is a defined way for one piece of software to talk to another. In practice, it usually means you send a request to a service and get structured data back, instead of poking at a website or system by hand.
ASN stands for Autonomous System Number: an identifier for a network operated by an ISP, cloud provider, or large organization on the internet. In scraping, ASN matters because many sites score traffic by network identity, not just by IP, so requests from a known data center ASN often get challenged faster than traffic from residential or ISP ASNs.
A backconnect proxy is a proxy endpoint that stays the same on your side while the provider rotates the exit IP behind it. In practice, it is a convenience layer for large proxy pools, usually used in scraping when you need rotation without constantly updating proxy lists yourself.
Backoff is the practice of waiting before retrying a request after a failure, block, or rate limit. In scraping, it helps you avoid hammering a site when it is already telling you to slow down, which improves stability and lowers the chance of getting banned.
Bandwidth is the amount of data your scraper sends and receives over the network. In scraping, it directly affects cost, speed, and how noisy your crawler looks to the target site, especially when you're pulling full pages, images, scripts, and retries you didn't actually need.
Base64 is a way to encode binary data as plain text using a limited set of ASCII characters. You see it all over the web in things like image blobs, tokens, API payloads, and sometimes scraped responses where the useful data is wrapped in an extra decoding step.
A bearer token is an access token sent in the Authorization header to prove the client is allowed to use an API. Whoever has the token can use it, so in practice it works like a password for API requests and needs to be handled with the same care.
Behavioral detection is a class of anti-bot checks that looks at how a visitor behaves, not just what IP, headers, or fingerprint they show up with. It flags automation when timing, scrolling, clicks, navigation flow, and page interaction patterns look too clean, too fast, or too mechanically consistent to be a real user.
Blacklisting is when a site marks your IP, session, account, or request pattern as untrusted and starts blocking, throttling, or challenging you. In scraping, it usually happens because your traffic looks automated, too aggressive, or just too repetitive over time.
Bounded randomness means adding variation to scraper behavior inside sensible limits instead of using fixed, perfectly repeatable timing. In practice, that usually means random delays, dwell times, and request spacing that stay within a defined range, so you look less bot-like without turning the job into chaos.
A browser context is an isolated browser session inside a single browser process. It gives you separate cookies, storage, and session state without paying the cost of launching a whole new browser every time, which matters a lot once you stop scraping one page at a time and start running real workloads.
A browser profile is the saved state a browser carries between sessions: cookies, local storage, cache, login state, preferences, and sometimes fingerprint-related settings. In scraping and automation, profiles matter when you need a session to keep behaving like the same user instead of starting from zero on every run.
A burst limit is the maximum number of requests you can send in a short spike before rate limiting kicks in. It matters because many systems allow brief bursts above the steady request rate, but they still block you if that spike is too large or happens too often.
A canonical URL is the preferred URL for a page when the same content is available at multiple URLs. It tells search engines which version should be treated as the main one, so ranking signals do not get split across duplicates, parameterized URLs, or near-identical versions.
Canvas fingerprinting is a browser fingerprinting technique where a site uses the HTML5 canvas API to draw hidden text or images, then hashes the rendered result to help identify a browser. It matters in scraping because those rendering differences can be used as a tracking signal, especially when your browser setup is inconsistent, headless, or obviously automated.
A CAPTCHA is a challenge a site shows to figure out whether the visitor is a human or an automated script. In scraping, it usually means the target thinks your traffic looks suspicious, often because of bad IPs, broken browser fingerprints, or bot-like request patterns.
CDATA is an XML section that tells the parser to treat the contents as raw text instead of markup. It’s mainly there so characters like < and & can appear without being escaped, which comes up a lot in RSS feeds, XML APIs, and embedded HTML or JavaScript.
A CDN, or content delivery network, is a distributed layer of servers that caches and serves website assets closer to the user. In scraping, CDNs matter because they change how content is delivered, cached, rate-limited, and blocked, especially when providers like Cloudflare or CloudFront sit in front of the origin.
CDP is the low-level protocol Chrome and other Chromium-based browsers expose for remote control, usually over a WebSocket connection. It lets you do the same kinds of things DevTools does: inspect pages, run JavaScript, intercept network traffic, read cookies, and capture screenshots. In scraping, people use it because it gives more direct browser control than higher-level automation libraries.
City-level routing means sending a scraping request through an IP located in a specific city instead of just picking a country or region. You use it when a site changes results, pricing, inventory, or anti-bot behavior based on the user’s apparent location, and country-level targeting is too blunt to be useful.
Concurrency is sending or processing multiple scraping tasks at the same time instead of waiting for each one to finish before starting the next. In scraping, that mostly means making many requests in flight at once so jobs finish faster, but pushing it too hard gets you rate-limited, blocked, or just creates a new reliability problem.
Content-Type is an HTTP header that tells you what kind of data is in the request or response body, like HTML, JSON, XML, or an image. In scraping, it matters because the body might not be what you expected, and treating JSON like HTML or a PDF like text is how parsers break in production.
Cookies are small pieces of data a website stores and sends back on later requests to keep track of sessions, logins, preferences, and basic state. In scraping, they matter because a lot of sites stop working the moment you ignore them, reuse them badly, or lose them between requests.
CORS, short for Cross-Origin Resource Sharing, is a browser security mechanism that controls whether JavaScript running on one origin can make requests to another. It matters a lot if you're scraping from frontend code, but it does not apply the same way to server-side scrapers, which is why people often hit it in the browser and then overcomplicate the fix.
Crawling is the process of discovering pages by starting from one or more URLs, fetching them, extracting links, and following those links across a site or across the web. It is about finding what exists and what changed; scraping is the separate step where you extract the data you actually care about.
CSS usually means Cascading Style Sheets, the language browsers use to control how HTML looks on the page. In scraping, though, people often mean CSS selectors: the pattern syntax used to find elements like buttons, links, product titles, or price blocks inside a document.
A datacenter proxy is an IP address served from a cloud or hosting provider, not a real household or mobile network. They’re fast, cheap, and great for high-volume scraping on easier targets, but they also get flagged more often because sites know datacenter IP ranges and block them aggressively.
DNS, or the Domain Name System, translates domain names like example.com into IP addresses that machines can actually connect to. In scraping, it is one of those layers people forget about until requests start failing, resolving slowly, or hitting the wrong infrastructure after a target changes providers or protection.
Docker packages an application and its runtime into a container so it runs the same way on your laptop, CI, and production. For scraping, that matters because browsers, system libraries, fonts, and anti-bot workarounds tend to break in slightly different ways on every machine if you do not pin the environment.
DOM stands for Document Object Model. It’s the tree-like structure a browser builds from a page’s HTML, where elements, attributes, and text become nodes you can inspect, query, and manipulate with JavaScript or a scraper.
An ETag is an HTTP response header that identifies a specific version of a resource. Browsers, CDNs, and bots use it for conditional requests, so the server can return 304 Not Modified instead of sending the full response again when nothing changed.
ETL stands for extract, transform, load: pull data from a source, clean or reshape it, then write it somewhere useful like a database, warehouse, or queue. In scraping, this is usually the part after the request succeeds, where raw HTML or API responses get turned into structured data that downstream systems can actually use.
An event loop is the part of a runtime that keeps track of async work and decides what runs next without blocking the whole program. In scraping, it matters because network requests, browser automation, waits, retries, and timeouts all pile up fast, and if you misuse the loop you get slow crawlers, stuck tasks, or weird concurrency bugs.
Exponential backoff is a retry strategy where you wait longer after each failed request, typically doubling the delay each time. It helps scrapers recover from temporary failures like rate limits, timeouts, and overloaded targets without hammering the site and making the problem worse.
Font fingerprinting is a browser identification technique that checks which fonts are available on a device and how text renders with them. Anti-bot systems use it as one signal in a larger fingerprint because the font set and rendering behavior often differ across operating systems, browsers, and automation setups.
A forward proxy is a server that sits between your app and the target website, sending requests on your behalf so the site sees the proxy IP instead of your origin IP. In scraping, this is the normal kind of proxy people mean: you route outbound traffic through it for IP rotation, geo-targeting, access control, or just to avoid getting blocked immediately.
Geolocation is the location context a request appears to come from, usually at the country, region, or city level. In scraping, it matters because many sites change content, pricing, availability, or blocking behavior based on where they think the visitor is coming from.
Geo-targeting means sending requests from a specific country, region, or city so a website returns the version of the page that real users in that location would see. In scraping, this matters because prices, availability, search results, and even entire pages often change by geography, and if you ignore that, your data is wrong before you even start cleaning it.
GraphQL is an API query language that lets a client ask for exactly the fields it wants instead of taking a fixed response shape from a REST endpoint. In scraping, it matters because many modern sites load data through GraphQL behind the frontend, which is often cleaner and more stable to work with than parsing constantly changing HTML.
Greylisting IP is when a site does not fully ban your IP, but quietly degrades or limits it because it looks suspicious. In scraping, this usually shows up as intermittent 403s, slower responses, CAPTCHA pages, empty results, or requests that work in a browser but fail from your scraper.
A headful browser is a browser running with a visible UI, like a normal desktop Chrome or Firefox session. In scraping, people use it when sites behave differently in headless mode, when debugging is easier with a real window, or when they need browser behavior that looks more like an actual user session.
A headless browser is a real browser running without a visible UI, usually controlled by code. In scraping, you use it when a site needs JavaScript execution, real rendering, or browser-like behavior that plain HTTP requests won’t handle reliably.
A honeypot is a trap used to catch bots or suspicious automation by exposing something a real user usually would not touch, like a hidden link, invisible form field, or fake endpoint. In scraping, hitting a honeypot is a fast way to get flagged because it tells the site you are parsing the page mechanically instead of behaving like a normal browser session.
HTTP, short for Hypertext Transfer Protocol, is the basic request-response protocol browsers, APIs, and scrapers use to talk to web servers. In practice, it’s the layer where you send a request like GET or POST and get back a response with status codes, headers, and a body, which is why scraping usually starts here before the real mess begins.
Hydration is the step where client-side JavaScript takes server-rendered HTML and turns it into a live app by attaching state, event handlers, and component logic. For scraping, it matters because a lot of modern sites ship useful data in the page before hydration finishes, and that data is often easier to extract than waiting for the fully rendered UI.
An IFrame is an HTML element that embeds one web page inside another. In scraping, this matters because the data you want often is not in the main page HTML at all, but loaded from the iframe's src as a separate document with its own requests, cookies, and sometimes its own anti-bot problems.
Infinite scroll is a page pattern where more content loads automatically as you scroll instead of exposing numbered pages or a visible Next button. For scrapers, that means the data is often fetched by JavaScript in batches, so grabbing the first HTML response is not enough.
IP reputation is the trust score websites implicitly assign to the IPs sending requests. In scraping, it decides whether your traffic gets clean responses, soft blocks, captchas, throttling, or silent junk even when your code is fine.
In scraping, ISP usually refers to an ISP proxy: an IP address announced by an internet service provider but hosted on fast server infrastructure. It sits in the middle between datacenter and residential proxies: cleaner reputation than datacenter in many cases, but cheaper and more stable than true residential traffic.
JA3 and JA4 are TLS client fingerprints used to identify patterns in how a client starts HTTPS connections. In scraping, they matter because many bot defenses use them to spot traffic from default Python HTTP stacks, headless tooling, or other non-browser clients before your request even gets to the page.
Jitter is a small random delay added to retries, request timing, or backoff so your traffic does not line up in neat bursts. In scraping, it matters because synchronized retries are a great way to hit the same rate limit again, especially when you run many workers, sessions, or accounts at once.
JSON stands for JavaScript Object Notation. It’s a plain text format for structured data, built from key-value pairs and arrays, and it’s what most scraping APIs return because machines can work with it without the usual HTML cleanup mess.
JSON-LD is structured data embedded in a page, usually inside a <script type="application/ld+json"> tag. For scraping, it matters because sites often put clean entity data there: product details, article metadata, breadcrumbs, ratings, offers, and other fields that are much easier to parse than the visible HTML.
JSON Schema is a way to define the shape of JSON data: which fields exist, what types they are, and what counts as valid. In scraping, it gives you a contract for the output so you get structured data you can actually rely on instead of vaguely shaped JSON that breaks downstream.
JWT stands for JSON Web Token: a compact token format used to send claims like user identity, expiration, and permissions between a client and a server. In scraping, you mostly run into JWTs when an API expects a Bearer token after login, especially on SPAs and mobile app backends.
OAuth 2.0 is an authorization framework that lets an app get limited access to a user’s account or data without handling the user’s password directly. In practice, it’s the thing behind "Sign in with Google" and a lot of API access flows, and it matters in scraping because authenticated sessions often depend on short-lived tokens, redirects, scopes, and refresh logic.
OCR, or optical character recognition, turns text inside images, screenshots, or PDFs into machine-readable text. In scraping, you use it when the data is visible on the page but not actually present in the HTML, which is common with scanned documents, captcha-like image text, and screenshot-based workflows.
OpenGraph is a set of HTML meta tags that tells platforms like Facebook, LinkedIn, Slack, and Discord how a page should look when someone shares its URL. It controls things like the title, description, image, and canonical URL used in link previews, which makes it a common target when scraping page metadata.
Proof of Work (PoW) is a system where a client has to spend some real compute effort before a request is accepted. On the web, it’s often used as bot friction: a browser or scraper must solve a small CPU or cryptographic challenge first, which is cheap for one human visit but expensive when you’re firing thousands of requests.
A proxy is an intermediary server that sends requests on your behalf, so the target site sees the proxy IP instead of yours. In scraping, proxies are mainly used to reduce IP-based blocking, spread traffic, and make requests appear from specific networks or countries, but proxies alone do not solve rendering, fingerprinting, or rate-limit problems.
RDFa is a way to embed structured data directly into HTML using attributes on existing elements. In scraping, it matters when a page exposes metadata like product details, authorship, or schema markup in the DOM instead of putting it in JSON-LD.
A residential proxy routes your requests through IP addresses assigned by consumer internet providers, so the traffic looks like it is coming from a normal home user instead of a data center. In scraping, people use them because they get blocked less often on sites that score IP reputation aggressively, but they cost more and add another thing that can fail.
REST, short for Representational State Transfer, is a common way to design web APIs around resources, HTTP methods, and standard status codes. In practice, it usually means you make predictable requests like GET, POST, PUT, and DELETE to URLs and get structured responses back, usually JSON.
Robots.txt is a text file on a website, usually served from /robots.txt, that tells crawlers which paths they are allowed or asked not to crawl. It is a crawler-facing policy file, not an enforcement mechanism, so decent bots read it and bad ones ignore it.
Rotating proxies are proxy networks that change the IP address used for outgoing requests, either on every request or on a defined schedule. In scraping, they help reduce bans, rate limits, and captchas, but they do not magically fix bad request patterns, broken sessions, or sloppy scraper behavior.
RSS is an XML-based feed format websites use to publish new content in a structured, machine-readable way. For scraping, it’s the easy path when a site gives you one: fewer moving parts, less breakage, and no need to render pages just to detect what changed.
A sandbox is a website or environment built for testing scraping code without the usual production mess. It gives you stable pages, predictable structure, and explicit permission to scrape, which makes it useful for learning, debugging selectors, and checking whether your tooling works before you point it at real sites.
a web scraping API that routes each request through the scraper and proxy setup that fit the target, then returns one normalized response. It is built for production scraping, where the real problem is not fetching one page once, but keeping many targets working without filling your codebase with routing, rendering, retry, and provider-specific logic.
SDK stands for software development kit: a packaged set of code, helpers, and documentation that makes it easier to use an API or platform from your language of choice. In scraping, an SDK usually saves you from hand-rolling request signing, retries, headers, and response parsing every time.
SERP stands for search engine results page: the page a search engine returns after someone searches for a query. In scraping, it usually means collecting structured data from Google, Bing, or other search result pages without manually parsing a mess of ads, maps, snippets, and ranking changes every week.
A session is the state a site keeps across multiple requests so it can treat them as coming from the same user flow. In scraping, that usually means cookies, auth state, cart state, CSRF tokens, or other bits that need to persist, otherwise things work for one request and then quietly break on the next.
Shadow DOM is a browser feature that lets a component keep its HTML and CSS inside an isolated subtree, so normal selectors often can’t see or reach it. For scraping, that usually means the element exists on the page but your parser or selector still comes back empty unless you explicitly traverse into the shadow root.
Sitemap.xml is an XML file that lists URLs a site wants crawlers to find, often with metadata like last modified date, update frequency, or priority. For scraping, it is one of the simplest ways to discover pages at scale without clicking through navigation, category trees, or endless pagination.
SNI stands for Server Name Indication, a TLS extension that lets a client say which hostname it wants before the HTTPS connection is fully set up. In practice, that matters because many sites share the same IP, and if the SNI value is wrong, missing, or blocked, the request can fail before scraping even gets to HTTP.
A SOCKS5 proxy is a low-level proxy protocol that forwards network traffic without rewriting it the way HTTP proxies do. It works with more kinds of traffic, including HTTPS, WebSockets, and non-HTTP connections, which makes it useful in scraping setups where HTTP proxies start breaking in annoying ways.
SSR stands for server-side rendering: the server builds the HTML before sending it to the browser. For scraping, that matters because SSR pages often expose the data you need directly in the initial response, so you can skip the whole browser-rendering mess.
Tarpitting is a defensive trick where a server deliberately slows down or traps a client instead of blocking it cleanly. The point is to waste the scraper, spammer, or scanner’s time and resources, which matters because a slow failure can be more expensive than a fast one in production.
TCP, or Transmission Control Protocol, is the transport layer protocol that makes sure data arrives reliably and in order between a client and server. In scraping, it sits underneath HTTP and HTTPS, so when requests fail before you even get a response, the problem is often down at the TCP level, not in your parser or request code.
TLS, short for Transport Layer Security, is the protocol that secures HTTPS connections by encrypting traffic between a client and a server. In scraping, it matters for more than encryption: sites also look at how your client performs the TLS handshake, and that fingerprint can be enough to get you blocked even if your requests are otherwise correct.
A TLS handshake is the first part of an HTTPS connection where the client and server agree on how to talk securely and exchange the keys used for encryption. For scraping, it matters because sites can inspect handshake details before any HTTP request is processed, which means a bad or unusual client fingerprint can get you flagged early.
TTFB stands for Time to First Byte: the time between making a request and receiving the first byte of the response. It tells you how long the network, server, TLS handshake, redirects, and backend work took before anything actually started coming back, so when it is bad, everything after it starts late too.
A virtual DOM is an in-memory representation of a page’s DOM that frameworks like React or Vue use to figure out what changed before updating the real browser DOM. It exists to make UI updates easier to manage, but for scraping the important part is simpler: sometimes the HTML you want only shows up after JavaScript builds it.
A VPN, or virtual private network, routes your traffic through another server and gives you a different outward-facing IP address. For scraping, that can help with basic location testing or low-volume requests, but it is not a real replacement for scraping proxies because you usually do not get reliable rotation, concurrency, or control.
A WAF, or Web Application Firewall, sits in front of a site and filters requests before they reach the application. In scraping, it is one of the main things blocking you in production: rate limits, CAPTCHAs, bot checks, fingerprinting, and silent 403s often come from the WAF, not the site itself.
WebDriver is the browser automation interface tools like Selenium use to control a real browser. In scraping, it matters when the page only renders data after JavaScript runs or when you need to click, scroll, type, or wait for elements like an actual user session.
WebGL is the browser API that lets websites render GPU-accelerated 2D and 3D graphics inside a page, usually through a canvas element, without plugins. For scraping, it matters less because of the graphics themselves and more because WebGL exposes hardware and rendering details that anti-bot systems use for fingerprinting.
WebRTC is a browser technology for real-time peer-to-peer communication, usually used for audio, video, and direct data transfer between clients. In scraping, it matters less because you need to scrape WebRTC itself and more because it can leak your real IP, bypass proxy assumptions, and make browser automation behave differently than a plain HTTP client.
Whitelisting means explicitly allowing a specific IP, API key, domain, or account to access something that would otherwise be blocked or rate-limited. In scraping, it usually comes up when a target, proxy provider, or internal system only accepts traffic from approved sources, which is fine until your IPs change and things quietly break.
XHR stands for XMLHttpRequest, the browser API used by JavaScript to make HTTP requests in the background without reloading the page. In scraping, people often say “watch the XHRs” because those requests often return the actual structured data, which is a lot cleaner than fighting brittle HTML.
XPath is a query language for selecting nodes inside XML or HTML documents using path-like expressions. In scraping, it’s mainly used to target elements precisely without looping through the whole DOM by hand, though brittle XPath selectors can break fast when a site’s structure shifts.