Examples
A lot of sites expose a sitemap at a predictable path:
curl https://example.com/sitemap.xml
A sitemap can contain page URLs directly:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/products/widget-a</loc>
<lastmod>2025-03-01</lastmod>
</url>
<url>
<loc>https://example.com/products/widget-b</loc>
<lastmod>2025-03-02</lastmod>
</url>
</urlset>
Or it can be a sitemap index that points to more sitemaps:
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemaps/products.xml</loc>
</sitemap>
<sitemap>
<loc>https://example.com/sitemaps/blog.xml</loc>
</sitemap>
</sitemapindex>
A simple Python example for extracting URLs from a sitemap:
import requests
import xml.etree.ElementTree as ET
xml = requests.get("https://example.com/sitemap.xml", timeout=30).text
root = ET.fromstring(xml)
ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
urls = [node.text for node in root.findall("sm:url/sm:loc", ns)]
print(urls[:10])
Practical tips
- Check
robots.txtfirst: many sites declare sitemap locations there, often more reliably than guessing/sitemap.xml. - Handle nested sitemap indexes: big sites split sitemaps by section, date, locale, or content type.
- Do not assume sitemap coverage is complete: some sites leave out older pages, faceted URLs, or pages they do not care about for search.
- Use
lastmodwhen it exists: it is useful for incremental scraping, but treat it as a hint, not ground truth. - Filter aggressively before scraping everything: product pages, articles, category pages, image sitemaps, and tag pages often all live in different sitemap files.
- Watch file size and compression: large sitemaps are commonly served as
.xml.gz. - If a site has a decent sitemap, use it: parsing one XML file is cheaper and less fragile than building a crawler that has to discover every internal link the hard way.
- In production, keep fallback discovery methods: sitemaps break, move, get partially updated, or silently miss sections.
A quick pattern for discovering sitemap URLs from robots.txt:
import requests
robots = requests.get("https://example.com/robots.txt", timeout=30).text
sitemaps = []
for line in robots.splitlines():
if line.lower().startswith("sitemap:"):
sitemaps.append(line.split(":", 1)[1].strip())
print(sitemaps)
Use cases
- Full-site URL discovery: useful for e-commerce catalogs, news archives, travel listings, directory sites.
- Incremental scraping: re-crawl URLs whose
lastmodchanged instead of re-scraping the whole site. - Section-based scraping: target product, blog, location, or category sitemaps separately so jobs stay smaller and easier to debug.
- Pagination avoidance: skip brittle click-through crawling when the sitemap already exposes the final page URLs.
- Coverage checks: compare sitemap URLs against URLs you discovered from internal links to see what the site exposes to crawlers versus what exists in practice.
- Router-based scraping pipelines: use the sitemap as the discovery layer, then send the extracted URLs through ScrapeRouter for the actual page fetches when anti-bot protection, JS rendering, or proxy routing starts mattering.