OCR | ScrapeRouter

OCR, or optical character recognition, turns text inside images, screenshots, or PDFs into machine-readable text. In scraping, you use it when the data is visible on the page but not actually present in the HTML, which is common with scanned documents, captcha-like image text, and screenshot-based workflows.

Examples

A few places OCR shows up in scraping:

Extracting product names or prices from image-based listings
Reading text from scanned PDF reports
Pulling labels from screenshots when a site renders data into a canvas instead of normal DOM text

import requests

url = "https://www.scraperouter.com/api/v1/scrape/"
headers = {
    "Authorization": "Api-Key $api_key",
    "Content-Type": "application/json",
}
payload = {
    "url": "https://example.com/report.pdf",
    "format": "text"
}

response = requests.post(url, headers=headers, json=payload)
print(response.text)

If the page puts text in the HTML, don't use OCR. Just parse the DOM. OCR is slower, noisier, and easier to get wrong.

Practical tips

Use OCR only when you have to: scanned PDFs, screenshots, canvas-rendered text, image-heavy pages
Expect errors: confusing 0 and O, 1 and I, broken table structure, dropped punctuation
Clean the image first if you control the pipeline: crop tightly, increase contrast, remove noise
Validate extracted text against known patterns: dates, SKUs, totals, currency formats
Keep a fallback path: HTML parsing first, OCR second
Watch cost and latency: OCR adds both, especially at volume
For production workflows, save the original image or PDF so you can debug bad reads later

Use cases

Invoice and receipt scraping: many documents are scanned, so there is no usable HTML or embedded text layer
PDF data extraction: annual reports, shipping documents, government filings, and forms often need OCR before you can parse anything useful
Canvas-based sites: some sites render visible text into images or canvas elements to make scraping more annoying
Screenshot pipelines: if your workflow captures a rendered page as an image, OCR is how you get text back out

OCR is one of those things that demos well and gets messy fast in production. It works, but you need to treat it like a fallback tool, not a clean substitute for structured data.