Glossary

GraphQL

GraphQL is an API query language that lets a client ask for exactly the fields it wants instead of taking a fixed response shape from a REST endpoint. In scraping, it matters because many modern sites load data through GraphQL behind the frontend, which is often cleaner and more stable to work with than parsing constantly changing HTML.

Examples

A lot of React and Next.js sites don't render product data directly into the HTML. The page loads, then the frontend sends a GraphQL request in the background and gets structured JSON back. If you can reproduce that request, you often skip a lot of brittle DOM scraping.

import requests

url = "https://example.com/graphql"
payload = {
    "operationName": "ProductPageQuery",
    "variables": {
        "slug": "running-shoe-123"
    },
    "query": """
    query ProductPageQuery($slug: String!) {
      product(slug: $slug) {
        id
        name
        price {
          amount
          currency
        }
        inStock
      }
    }
    """
}

resp = requests.post(url, json=payload, headers={
    "content-type": "application/json"
})

print(resp.json())

You can also inspect GraphQL traffic in the browser network tab. Look for POST requests to paths like /graphql, /api/graphql, or requests carrying query, operationName, and variables.

curl 'https://example.com/graphql' \
  -H 'content-type: application/json' \
  --data-raw '{
    "operationName":"ProductPageQuery",
    "variables":{"slug":"running-shoe-123"},
    "query":"query ProductPageQuery($slug: String!) { product(slug: $slug) { id name inStock } }"
  }'

Practical tips

  • Check the network first: if a site uses GraphQL, the useful data is often already in a structured API response. That's a better starting point than trying to reverse-engineer messy frontend HTML.
  • Copy the full request shape: GraphQL requests often depend on more than the query itself: headers, cookies, persisted query hashes, auth tokens, and CSRF protection.
  • Watch for persisted queries: some apps don't send the full query text, only a hash and variables. That works fine until the site rotates hashes or changes its client bundle.
  • Don't assume it's public: GraphQL is just an API layer. It can still be rate limited, blocked, geo-restricted, or tied to a logged-in session.
  • Expect schema drift: field names, nesting, and operation names change. It's still often more stable than scraping HTML, but it's not magic.
  • Use browser rendering when needed: if the request depends on tokens generated in-session or anti-bot checks, reproducing GraphQL directly may be annoying. In those cases, rendering the page and capturing the final network flow is often faster. That's the practical reason router layers exist.
  • Validate nulls and partial responses: GraphQL can return data and errors in the same response. Don't treat every 200 response as success.
{
  "data": {
    "product": null
  },
  "errors": [
    {
      "message": "Product not found"
    }
  ]
}

Use cases

  • Ecommerce scraping: pull product title, price, stock status, variants, and reviews from the GraphQL calls the frontend already makes.
  • Search and listing extraction: many category pages fetch paginated results through GraphQL rather than embedding everything in HTML.
  • Authenticated dashboards: internal tools, seller panels, and account pages often use GraphQL heavily once you're inside a logged-in session.
  • Dynamic single-page apps: if the page shell is mostly useless HTML and the real content arrives through API calls, GraphQL is often the cleanest path.
  • Lower-maintenance scrapers: when HTML classes churn every other week, a stable backend query can save a lot of pointless maintenance time.

Related terms

API Scraping REST API XHR JSON Browser Rendering Headless Browser Anti-Bot