XPath | ScrapeRouter

XPath is a query language for selecting nodes inside XML or HTML documents using path-like expressions. In scraping, it’s mainly used to target elements precisely without looping through the whole DOM by hand, though brittle XPath selectors can break fast when a site’s structure shifts.

Examples

XPath is useful when CSS selectors stop being enough, especially when you need to match text, move up the DOM, or select by position.

from lxml import html

page = html.fromstring("""
<html>
  <body>
    <div class="product">
      <h2>Red Shoes</h2>
      <span class="price">$49</span>
    </div>
    <div class="product">
      <h2>Blue Shoes</h2>
      <span class="price">$59</span>
    </div>
  </body>
</html>
""")

names = page.xpath('//div[@class="product"]/h2/text()')
prices = page.xpath('//div[@class="product"]/span[@class="price"]/text()')

print(names)
print(prices)

# Select the first matching element
page.xpath('(//div[@class="product"])[1]//h2/text()')

# Select by text content
page.xpath('//button[contains(text(), "Load more")]')

# Select an attribute
page.xpath('//a[@class="next"]/@href')

If you're scraping through a browser or a scraping API, the XPath itself is just one part. The annoying part is keeping it working when the page gets re-rendered, reordered, or A/B tested.

Practical tips

Prefer stable attributes over deep full-path selectors: //div[3]/section[2]/ul/li[4] works until the site changes one wrapper.
Use XPath when CSS becomes awkward: text matching, parent traversal, positional selection.
Don’t overfit to today’s DOM: if a selector only works on one exact page shape, it will probably fail later.
Test selectors against multiple pages, not just one sample.
If the site is JavaScript-heavy, make sure the HTML is actually rendered before blaming the XPath.
In production, log selector failures separately from request failures. A bad XPath and a blocked request are different problems.
If you’re routing scraping jobs through ScrapeRouter, XPath still matters at the extraction layer: the router helps with fetch stability, but it can’t save a selector that was too fragile to begin with.

Use cases

Extracting product names, prices, links, ratings, and stock labels from messy e-commerce pages.
Pulling data from HTML where CSS selectors aren’t enough: matching text, selecting siblings, walking up to parent nodes.
Parsing XML feeds, sitemaps, and structured exports where XPath is the native tool.
Browser automation and testing: locating elements in Selenium or similar tools when the DOM is complicated.
Recovering data from inconsistent markup where you need more control than simple class-based selection gives you.