Examples
A scraper usually reads the DOM after the page loads, then selects the parts it cares about.
from bs4 import BeautifulSoup
html = """
<html>
<body>
<div class="product">
<h2>Running Shoes</h2>
<span class="price">$89</span>
</div>
</body>
</html>
"""
soup = BeautifulSoup(html, "html.parser")
name = soup.select_one(".product h2").get_text(strip=True)
price = soup.select_one(".price").get_text(strip=True)
print({"name": name, "price": price})
On JavaScript-heavy pages, the DOM you get after rendering can be very different from the raw HTML response.
# Playwright example: wait for the rendered DOM
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://example.com")
page.wait_for_selector(".product")
products = page.locator(".product").all_text_contents()
print(products)
browser.close()
Practical tips
- Raw HTML and DOM are not always the same: if a site renders content with JavaScript, your HTTP client may miss data that appears fine in the browser.
- Selectors break: classes, nesting, and generated IDs change all the time, so DOM-based scraping works until it suddenly doesn't.
- Prefer stable attributes when possible: use things like
data-testid, semantic tags, URLs, or visible text before grabbing random hashed class names. - Wait for the right state: in browser automation, don't just wait for page load, wait for the specific DOM element that means the data is actually there.
- Keep extraction logic small: the more your scraper depends on deep DOM structure, the more maintenance you're signing up for.
- Use a rendered scraper only when needed: if the data is already in the initial response or an API call, parsing the DOM is usually slower and more expensive.
- If you're using ScrapeRouter, this is the kind of thing that matters in production: some pages need simple HTTP fetching, some need full rendering, and routing that cleanly saves a lot of wasted retries and browser time.
Use cases
- Extracting product names, prices, ratings, and links from e-commerce pages.
- Pulling article titles, author names, and publish dates from news or blog pages.
- Reading tables, listings, and pagination controls from server-rendered pages.
- Scraping JavaScript-rendered content after the browser builds the final DOM.
- Validating whether the content you want is present in raw HTML or only appears after client-side rendering.