Glossary

Sitemap.xml

Sitemap.xml is an XML file that lists URLs a site wants crawlers to find, often with metadata like last modified date, update frequency, or priority. For scraping, it is one of the simplest ways to discover pages at scale without clicking through navigation, category trees, or endless pagination.

Examples

A lot of sites expose a sitemap at a predictable path:

curl https://example.com/sitemap.xml

A sitemap can contain page URLs directly:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/products/widget-a</loc>
    <lastmod>2025-03-01</lastmod>
  </url>
  <url>
    <loc>https://example.com/products/widget-b</loc>
    <lastmod>2025-03-02</lastmod>
  </url>
</urlset>

Or it can be a sitemap index that points to more sitemaps:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemaps/products.xml</loc>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemaps/blog.xml</loc>
  </sitemap>
</sitemapindex>

A simple Python example for extracting URLs from a sitemap:

import requests
import xml.etree.ElementTree as ET

xml = requests.get("https://example.com/sitemap.xml", timeout=30).text
root = ET.fromstring(xml)
ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}

urls = [node.text for node in root.findall("sm:url/sm:loc", ns)]
print(urls[:10])

Practical tips

  • Check robots.txt first: many sites declare sitemap locations there, often more reliably than guessing /sitemap.xml.
  • Handle nested sitemap indexes: big sites split sitemaps by section, date, locale, or content type.
  • Do not assume sitemap coverage is complete: some sites leave out older pages, faceted URLs, or pages they do not care about for search.
  • Use lastmod when it exists: it is useful for incremental scraping, but treat it as a hint, not ground truth.
  • Filter aggressively before scraping everything: product pages, articles, category pages, image sitemaps, and tag pages often all live in different sitemap files.
  • Watch file size and compression: large sitemaps are commonly served as .xml.gz.
  • If a site has a decent sitemap, use it: parsing one XML file is cheaper and less fragile than building a crawler that has to discover every internal link the hard way.
  • In production, keep fallback discovery methods: sitemaps break, move, get partially updated, or silently miss sections.

A quick pattern for discovering sitemap URLs from robots.txt:

import requests

robots = requests.get("https://example.com/robots.txt", timeout=30).text
sitemaps = []
for line in robots.splitlines():
    if line.lower().startswith("sitemap:"):
        sitemaps.append(line.split(":", 1)[1].strip())

print(sitemaps)

Use cases

  • Full-site URL discovery: useful for e-commerce catalogs, news archives, travel listings, directory sites.
  • Incremental scraping: re-crawl URLs whose lastmod changed instead of re-scraping the whole site.
  • Section-based scraping: target product, blog, location, or category sitemaps separately so jobs stay smaller and easier to debug.
  • Pagination avoidance: skip brittle click-through crawling when the sitemap already exposes the final page URLs.
  • Coverage checks: compare sitemap URLs against URLs you discovered from internal links to see what the site exposes to crawlers versus what exists in practice.
  • Router-based scraping pipelines: use the sitemap as the discovery layer, then send the extracted URLs through ScrapeRouter for the actual page fetches when anti-bot protection, JS rendering, or proxy routing starts mattering.

Related terms

robots.txt Web Crawler URL Discovery Pagination Internal Links Canonical URL XML Parsing