Examples
A common case is Next.js. The page HTML often includes serialized app data in script tags, and you can pull that directly instead of driving a browser around and hoping selectors stay stable.
import requests
from bs4 import BeautifulSoup
import json
url = "https://example.com/product/123"
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
script = soup.find("script", id="__NEXT_DATA__")
if script:
data = json.loads(script.string)
product = data["props"]["pageProps"].get("product")
print(product)
If the site only becomes usable after hydration, you may need a browser-capable scraper.
curl -X POST "https://www.scraperouter.com/api/v1/scrape/" \
-H "Authorization: Api-Key $api_key" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/app",
"render": true
}'
Practical tips
- Check the raw HTML before reaching for browser automation: look for
__NEXT_DATA__,window.__INITIAL_STATE__, JSON blobs, and inline script tags. - If the data is in the HTML before hydration, extract that instead of scraping rendered DOM text: it is faster, cheaper, and usually less fragile.
- If hydration fetches data from XHR or GraphQL after load, inspect network calls: scraping the API response is often cleaner than scraping the UI.
- Don’t confuse hydrated UI with data availability: the page can look empty in the browser for a moment while the raw response already contains what you need.
- In production, hydration patterns change during frontend rewrites: selectors break, embedded JSON keys move, script tag IDs change. Monitor for that, don’t assume it stays fixed.
Use cases
- Scraping Next.js sites: parse
__NEXT_DATA__instead of waiting for cards, tables, or product widgets to appear. - Reducing browser usage: skip full rendering when the important data is already embedded in the initial HTML.
- Stabilizing extraction: pull structured hydration data rather than relying on brittle CSS selectors tied to the visual layer.
- Debugging missing data: compare raw HTML vs rendered page to see whether the issue is hydration, delayed API calls, or anti-bot behavior.