RSS | ScrapeRouter

RSS is an XML-based feed format websites use to publish new content in a structured, machine-readable way. For scraping, it’s the easy path when a site gives you one: fewer moving parts, less breakage, and no need to render pages just to detect what changed.

Examples

A lot of publishers, blogs, forums, and changelog pages expose an RSS feed that gives you titles, links, publish dates, and sometimes full content. If the feed has what you need, use it instead of scraping HTML.

curl https://example.com/feed.xml

You can also fetch and parse a feed through ScrapeRouter if you want one pipeline for both feeds and normal pages.

curl -X POST "https://www.scraperouter.com/api/v1/scrape/" \
  -H "Authorization: Api-Key $api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/feed.xml"
  }'

import requests
import xml.etree.ElementTree as ET

xml_text = requests.get("https://example.com/feed.xml", timeout=30).text
root = ET.fromstring(xml_text)

for item in root.findall(".//item"):
    title = item.findtext("title")
    link = item.findtext("link")
    pub_date = item.findtext("pubDate")
    print(title, link, pub_date)

Practical tips

Check for feeds before building a full scraper: /feed, /rss, /atom.xml, page source <link rel="alternate">, CMS defaults.
Don’t assume feed completeness: some feeds only include summaries, limited history, or delayed updates.
Treat feed items as change detection, not perfect ground truth: use GUIDs, links, and publish timestamps carefully.
Watch for format differences: RSS 2.0, Atom, namespaces, CDATA, malformed XML.
If a site has both HTML and RSS, use RSS for discovery and HTML scraping only when you need extra fields.
In production, feeds are cheaper to poll than full pages: less bandwidth, fewer parser failures, less browser usage.
Still handle failures: feeds disappear, get truncated, or silently stop updating.

Use cases

Content monitoring: track new blog posts, newsroom updates, job listings, podcast episodes.
Cheap change detection: poll feeds to find new URLs, then scrape only those pages.
Aggregation: combine updates from many publishers without maintaining a custom scraper per site.
Backfill-light workflows: ingest recent items fast when you don’t need the whole site history.
Hybrid pipelines: RSS for discovery, HTML scraping for author pages, full article text, metadata, or assets.