Glossary

JSON-LD

JSON-LD is structured data embedded in a page, usually inside a <script type="application/ld+json"> tag. For scraping, it matters because sites often put clean entity data there: product details, article metadata, breadcrumbs, ratings, offers, and other fields that are much easier to parse than the visible HTML.

Examples

A lot of pages expose the useful stuff in JSON-LD instead of making you reverse-engineer messy HTML.

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Product",
  "name": "Mechanical Keyboard",
  "sku": "MK-104",
  "brand": {
    "@type": "Brand",
    "name": "KeyCo"
  },
  "offers": {
    "@type": "Offer",
    "priceCurrency": "USD",
    "price": "129.99",
    "availability": "https://schema.org/InStock"
  }
}
</script>

In Python, extracting it is often simpler than scraping the rendered page structure:

import json
from bs4 import BeautifulSoup

html = open("page.html", "r", encoding="utf-8").read()
soup = BeautifulSoup(html, "html.parser")

blocks = soup.select('script[type="application/ld+json"]')
for block in blocks:
    try:
        data = json.loads(block.get_text(strip=True))
        print(data)
    except json.JSONDecodeError:
        pass

If you're fetching the page through ScrapeRouter first, the extraction step stays the same. You just swap out the network layer:

import requests
import json
from bs4 import BeautifulSoup

resp = requests.post(
    "https://www.scraperouter.com/api/v1/scrape/",
    headers={"Authorization": "Api-Key $api_key"},
    json={"url": "https://example.com/product/123"}
)
resp.raise_for_status()

html = resp.text
soup = BeautifulSoup(html, "html.parser")

for block in soup.select('script[type="application/ld+json"]'):
    try:
        print(json.loads(block.get_text(strip=True)))
    except json.JSONDecodeError:
        continue

Practical tips

  • Check JSON-LD before building a fragile HTML parser. A surprising number of ecommerce, article, recipe, job, and event pages already expose the fields you want.
  • Expect multiple blocks on one page: breadcrumbs, organization info, product data, FAQ data. Don't assume the first script tag is the one you need.
  • Handle both shapes: a single object, and arrays or @graph containers with multiple entities.
  • Validate important fields against the rendered page when accuracy matters. JSON-LD is cleaner, but it can still be stale, incomplete, or generated for SEO rather than users.
  • Keep a fallback parser. In production, some pages remove structured data, break the JSON, or only include partial fields.
  • Watch for malformed JSON: trailing commas, bad escaping, HTML entities, or multiple objects jammed together. This happens more than people like to admit.

A simple extractor that handles common shapes:

import json
from bs4 import BeautifulSoup


def extract_jsonld(html):
    soup = BeautifulSoup(html, "html.parser")
    items = []

    for block in soup.select('script[type="application/ld+json"]'):
        raw = block.get_text(strip=True)
        if not raw:
            continue
        try:
            data = json.loads(raw)
        except json.JSONDecodeError:
            continue

        if isinstance(data, list):
            items.extend(data)
        elif isinstance(data, dict) and "@graph" in data and isinstance(data["@graph"], list):
            items.extend(data["@graph"])
        else:
            items.append(data)

    return items

Use cases

  • Product scraping: name, SKU, brand, price, currency, availability, aggregate rating.
  • Article extraction: headline, author, publish date, modified date, image, article section.
  • Local business data: address, opening hours, phone number, geo coordinates.
  • SERP and SEO monitoring: compare what a site declares in structured data versus what actually renders.
  • Stability-first scrapers: use JSON-LD as the primary source, HTML selectors as fallback. That's often the cheapest setup to maintain over time.
  • Entity extraction at scale: when you're crawling thousands of pages, pulling structured data is a lot less painful than maintaining dozens of brittle DOM parsers.

Related terms

Structured Data Schema.org Microdata HTML Parsing CSS Selectors XPath Web Scraping