Examples
A lot of React and Next.js sites don't render product data directly into the HTML. The page loads, then the frontend sends a GraphQL request in the background and gets structured JSON back. If you can reproduce that request, you often skip a lot of brittle DOM scraping.
import requests
url = "https://example.com/graphql"
payload = {
"operationName": "ProductPageQuery",
"variables": {
"slug": "running-shoe-123"
},
"query": """
query ProductPageQuery($slug: String!) {
product(slug: $slug) {
id
name
price {
amount
currency
}
inStock
}
}
"""
}
resp = requests.post(url, json=payload, headers={
"content-type": "application/json"
})
print(resp.json())
You can also inspect GraphQL traffic in the browser network tab. Look for POST requests to paths like /graphql, /api/graphql, or requests carrying query, operationName, and variables.
curl 'https://example.com/graphql' \
-H 'content-type: application/json' \
--data-raw '{
"operationName":"ProductPageQuery",
"variables":{"slug":"running-shoe-123"},
"query":"query ProductPageQuery($slug: String!) { product(slug: $slug) { id name inStock } }"
}'
Practical tips
- Check the network first: if a site uses GraphQL, the useful data is often already in a structured API response. That's a better starting point than trying to reverse-engineer messy frontend HTML.
- Copy the full request shape: GraphQL requests often depend on more than the query itself: headers, cookies, persisted query hashes, auth tokens, and CSRF protection.
- Watch for persisted queries: some apps don't send the full query text, only a hash and variables. That works fine until the site rotates hashes or changes its client bundle.
- Don't assume it's public: GraphQL is just an API layer. It can still be rate limited, blocked, geo-restricted, or tied to a logged-in session.
- Expect schema drift: field names, nesting, and operation names change. It's still often more stable than scraping HTML, but it's not magic.
- Use browser rendering when needed: if the request depends on tokens generated in-session or anti-bot checks, reproducing GraphQL directly may be annoying. In those cases, rendering the page and capturing the final network flow is often faster. That's the practical reason router layers exist.
- Validate nulls and partial responses: GraphQL can return
dataanderrorsin the same response. Don't treat every 200 response as success.
{
"data": {
"product": null
},
"errors": [
{
"message": "Product not found"
}
]
}
Use cases
- Ecommerce scraping: pull product title, price, stock status, variants, and reviews from the GraphQL calls the frontend already makes.
- Search and listing extraction: many category pages fetch paginated results through GraphQL rather than embedding everything in HTML.
- Authenticated dashboards: internal tools, seller panels, and account pages often use GraphQL heavily once you're inside a logged-in session.
- Dynamic single-page apps: if the page shell is mostly useless HTML and the real content arrives through API calls, GraphQL is often the cleanest path.
- Lower-maintenance scrapers: when HTML classes churn every other week, a stable backend query can save a lot of pointless maintenance time.