Examples
A lot of pages expose the useful stuff in JSON-LD instead of making you reverse-engineer messy HTML.
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "Product",
"name": "Mechanical Keyboard",
"sku": "MK-104",
"brand": {
"@type": "Brand",
"name": "KeyCo"
},
"offers": {
"@type": "Offer",
"priceCurrency": "USD",
"price": "129.99",
"availability": "https://schema.org/InStock"
}
}
</script>
In Python, extracting it is often simpler than scraping the rendered page structure:
import json
from bs4 import BeautifulSoup
html = open("page.html", "r", encoding="utf-8").read()
soup = BeautifulSoup(html, "html.parser")
blocks = soup.select('script[type="application/ld+json"]')
for block in blocks:
try:
data = json.loads(block.get_text(strip=True))
print(data)
except json.JSONDecodeError:
pass
If you're fetching the page through ScrapeRouter first, the extraction step stays the same. You just swap out the network layer:
import requests
import json
from bs4 import BeautifulSoup
resp = requests.post(
"https://www.scraperouter.com/api/v1/scrape/",
headers={"Authorization": "Api-Key $api_key"},
json={"url": "https://example.com/product/123"}
)
resp.raise_for_status()
html = resp.text
soup = BeautifulSoup(html, "html.parser")
for block in soup.select('script[type="application/ld+json"]'):
try:
print(json.loads(block.get_text(strip=True)))
except json.JSONDecodeError:
continue
Practical tips
- Check JSON-LD before building a fragile HTML parser. A surprising number of ecommerce, article, recipe, job, and event pages already expose the fields you want.
- Expect multiple blocks on one page: breadcrumbs, organization info, product data, FAQ data. Don't assume the first script tag is the one you need.
- Handle both shapes: a single object, and arrays or
@graphcontainers with multiple entities. - Validate important fields against the rendered page when accuracy matters. JSON-LD is cleaner, but it can still be stale, incomplete, or generated for SEO rather than users.
- Keep a fallback parser. In production, some pages remove structured data, break the JSON, or only include partial fields.
- Watch for malformed JSON: trailing commas, bad escaping, HTML entities, or multiple objects jammed together. This happens more than people like to admit.
A simple extractor that handles common shapes:
import json
from bs4 import BeautifulSoup
def extract_jsonld(html):
soup = BeautifulSoup(html, "html.parser")
items = []
for block in soup.select('script[type="application/ld+json"]'):
raw = block.get_text(strip=True)
if not raw:
continue
try:
data = json.loads(raw)
except json.JSONDecodeError:
continue
if isinstance(data, list):
items.extend(data)
elif isinstance(data, dict) and "@graph" in data and isinstance(data["@graph"], list):
items.extend(data["@graph"])
else:
items.append(data)
return items
Use cases
- Product scraping: name, SKU, brand, price, currency, availability, aggregate rating.
- Article extraction: headline, author, publish date, modified date, image, article section.
- Local business data: address, opening hours, phone number, geo coordinates.
- SERP and SEO monitoring: compare what a site declares in structured data versus what actually renders.
- Stability-first scrapers: use JSON-LD as the primary source, HTML selectors as fallback. That's often the cheapest setup to maintain over time.
- Entity extraction at scale: when you're crawling thousands of pages, pulling structured data is a lot less painful than maintaining dozens of brittle DOM parsers.