Examples
A page can expose structured data with RDFa attributes right in the HTML:
<div vocab="https://schema.org/" typeof="Product">
<span property="name">Noise Cancelling Headphones</span>
<span property="brand">Acme Audio</span>
<span property="offers" typeof="Offer">
<meta property="priceCurrency" content="USD" />
<meta property="price" content="199.99" />
</span>
</div>
If you're scraping, you either parse the RDFa properly or fall back to raw DOM selection if you just need a few fields:
from bs4 import BeautifulSoup
html = """
<div vocab="https://schema.org/" typeof="Product">
<span property="name">Noise Cancelling Headphones</span>
<span property="brand">Acme Audio</span>
</div>
"""
soup = BeautifulSoup(html, "html.parser")
name = soup.select_one('[property="name"]').get_text(strip=True)
brand = soup.select_one('[property="brand"]').get_text(strip=True)
print({"name": name, "brand": brand})
If the page is JS-rendered, you may need a browser-based scrape first so the RDFa is actually present in the final DOM:
curl -X POST https://www.scraperouter.com/api/v1/scrape/ \
-H "Authorization: Api-Key $api_key" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/product/123",
"render": true
}'
Practical tips
- Check for RDFa when JSON-LD is missing: a lot of teams assume structured data means a script tag, then miss data sitting in normal HTML attributes.
- Look for common RDFa attributes:
property,typeof,vocab,resource,about. - Treat RDFa as one source, not truth from heaven: publishers break markup all the time, mix vocabularies, or leave stale values in the page.
- In production, support multiple extraction paths: JSON-LD, RDFa, Microdata, then plain CSS/XPath fallback.
- If the site hydrates client-side, make sure you're parsing the rendered DOM, not just the initial response.
- Don't over-engineer it if you only need one field: direct DOM selectors are often cheaper than a full RDFa parser.
Use cases
- SEO and schema auditing: extracting product, article, review, breadcrumb, or organization markup from pages.
- Metadata recovery in scraping pipelines: pulling fields from pages where there is no clean API and no JSON-LD.
- Monitoring publisher changes: catching when structured data disappears, moves, or gets partially broken after frontend updates.
- Content aggregation: extracting author, headline, publish date, and entity metadata from article pages.