RDFa | ScrapeRouter

RDFa is a way to embed structured data directly into HTML using attributes on existing elements. In scraping, it matters when a page exposes metadata like product details, authorship, or schema markup in the DOM instead of putting it in JSON-LD.

Examples

A page can expose structured data with RDFa attributes right in the HTML:

<div vocab="https://schema.org/" typeof="Product">
  <span property="name">Noise Cancelling Headphones</span>
  <span property="brand">Acme Audio</span>
  <span property="offers" typeof="Offer">
    <meta property="priceCurrency" content="USD" />
    <meta property="price" content="199.99" />
  </span>
</div>

If you're scraping, you either parse the RDFa properly or fall back to raw DOM selection if you just need a few fields:

from bs4 import BeautifulSoup

html = """
<div vocab="https://schema.org/" typeof="Product">
  <span property="name">Noise Cancelling Headphones</span>
  <span property="brand">Acme Audio</span>
</div>
"""

soup = BeautifulSoup(html, "html.parser")
name = soup.select_one('[property="name"]').get_text(strip=True)
brand = soup.select_one('[property="brand"]').get_text(strip=True)

print({"name": name, "brand": brand})

If the page is JS-rendered, you may need a browser-based scrape first so the RDFa is actually present in the final DOM:

curl -X POST https://www.scraperouter.com/api/v1/scrape/ \
  -H "Authorization: Api-Key $api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/product/123",
    "render": true
  }'

Practical tips

Check for RDFa when JSON-LD is missing: a lot of teams assume structured data means a script tag, then miss data sitting in normal HTML attributes.
Look for common RDFa attributes: property, typeof, vocab, resource, about.
Treat RDFa as one source, not truth from heaven: publishers break markup all the time, mix vocabularies, or leave stale values in the page.
In production, support multiple extraction paths: JSON-LD, RDFa, Microdata, then plain CSS/XPath fallback.
If the site hydrates client-side, make sure you're parsing the rendered DOM, not just the initial response.
Don't over-engineer it if you only need one field: direct DOM selectors are often cheaper than a full RDFa parser.

Use cases

SEO and schema auditing: extracting product, article, review, breadcrumb, or organization markup from pages.
Metadata recovery in scraping pipelines: pulling fields from pages where there is no clean API and no JSON-LD.
Monitoring publisher changes: catching when structured data disappears, moves, or gets partially broken after frontend updates.
Content aggregation: extracting author, headline, publish date, and entity metadata from article pages.