DOM | ScrapeRouter

DOM stands for Document Object Model. It’s the tree-like structure a browser builds from a page’s HTML, where elements, attributes, and text become nodes you can inspect, query, and manipulate with JavaScript or a scraper.

Examples

A scraper usually reads the DOM after the page loads, then selects the parts it cares about.

from bs4 import BeautifulSoup

html = """
<html>
  <body>
    <div class="product">
      <h2>Running Shoes</h2>
      <span class="price">$89</span>
    </div>
  </body>
</html>
"""

soup = BeautifulSoup(html, "html.parser")
name = soup.select_one(".product h2").get_text(strip=True)
price = soup.select_one(".price").get_text(strip=True)

print({"name": name, "price": price})

On JavaScript-heavy pages, the DOM you get after rendering can be very different from the raw HTML response.

# Playwright example: wait for the rendered DOM
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://example.com")
    page.wait_for_selector(".product")
    products = page.locator(".product").all_text_contents()
    print(products)
    browser.close()

Practical tips

Raw HTML and DOM are not always the same: if a site renders content with JavaScript, your HTTP client may miss data that appears fine in the browser.
Selectors break: classes, nesting, and generated IDs change all the time, so DOM-based scraping works until it suddenly doesn't.
Prefer stable attributes when possible: use things like data-testid, semantic tags, URLs, or visible text before grabbing random hashed class names.
Wait for the right state: in browser automation, don't just wait for page load, wait for the specific DOM element that means the data is actually there.
Keep extraction logic small: the more your scraper depends on deep DOM structure, the more maintenance you're signing up for.
Use a rendered scraper only when needed: if the data is already in the initial response or an API call, parsing the DOM is usually slower and more expensive.
If you're using ScrapeRouter, this is the kind of thing that matters in production: some pages need simple HTTP fetching, some need full rendering, and routing that cleanly saves a lot of wasted retries and browser time.

Use cases

Extracting product names, prices, ratings, and links from e-commerce pages.
Pulling article titles, author names, and publish dates from news or blog pages.
Reading tables, listings, and pagination controls from server-rendered pages.
Scraping JavaScript-rendered content after the browser builds the final DOM.
Validating whether the content you want is present in raw HTML or only appears after client-side rendering.