Glossary

ETL

ETL stands for extract, transform, load: pull data from a source, clean or reshape it, then write it somewhere useful like a database, warehouse, or queue. In scraping, this is usually the part after the request succeeds, where raw HTML or API responses get turned into structured data that downstream systems can actually use.

Examples

A basic scraping ETL flow looks like this:

  • Extract: fetch product pages or API responses
  • Transform: parse fields, normalize prices, clean dates, dedupe records
  • Load: write the final rows into Postgres, S3, BigQuery, or a message queue
import requests
from bs4 import BeautifulSoup

url = "https://example.com/product/123"
html = requests.get(url, timeout=30).text

# extract + transform
soup = BeautifulSoup(html, "html.parser")
name = soup.select_one("h1").get_text(strip=True)
price_text = soup.select_one(".price").get_text(strip=True)
price = float(price_text.replace("$", "").replace(",", ""))

record = {
    "url": url,
    "name": name,
    "price": price,
}

# load
print(record)
# in real life: insert into a DB, send to S3, publish to Kafka, etc.

With scraping, the annoying part is that ETL often breaks in the transform step, not the extract step. You got the page, but the selector changed, the price format shifted, or the site started mixing junk into the response.

Practical tips

  • Keep extract and transform separate. If parsing breaks, you want the raw response saved somewhere so you can reprocess it without hitting the site again.
  • Treat data cleaning as production code, not glue code: add validation, logging, and version your parsers.
  • Expect messy inputs: missing fields, duplicate pages, locale-specific numbers, weird timestamps, partial renders.
  • Load idempotently when you can: use stable IDs, upserts, or dedupe keys so reruns do not create garbage.
  • If you're scraping at scale, the pipeline usually looks like this: browser or HTTP fetch, parsing, normalization, storage, retry handling.
  • ScrapeRouter mostly helps on the extract side: getting the page reliably. You still need to decide how to structure, validate, and load the data after that.
# sanity check transformed output before loading
python etl_job.py | jq .
record = {
    "product_id": "123",
    "price": 19.99,
    "currency": "USD",
    "scraped_at": "2026-03-25T12:00:00Z"
}

required = ["product_id", "price", "currency"]
for field in required:
    if field not in record or record[field] in (None, ""):
        raise ValueError(f"Missing required field: {field}")

Use cases

  • Turning scraped product pages into clean catalog data for search, pricing, or inventory systems
  • Collecting job listings, normalizing fields like title, location, salary, and loading them into a database
  • Pulling article pages, extracting metadata and content, then loading them into an indexing pipeline
  • Ingesting data from multiple sources into one schema so downstream analytics are not dealing with raw site-specific mess
  • Reprocessing previously saved HTML when parsing logic changes, instead of re-scraping everything

Related terms

Web Scraping Data Pipeline Parser Structured Data Data Extraction JSON