Examples
A few places OCR shows up in scraping:
- Extracting product names or prices from image-based listings
- Reading text from scanned PDF reports
- Pulling labels from screenshots when a site renders data into a canvas instead of normal DOM text
import requests
url = "https://www.scraperouter.com/api/v1/scrape/"
headers = {
"Authorization": "Api-Key $api_key",
"Content-Type": "application/json",
}
payload = {
"url": "https://example.com/report.pdf",
"format": "text"
}
response = requests.post(url, headers=headers, json=payload)
print(response.text)
If the page puts text in the HTML, don't use OCR. Just parse the DOM. OCR is slower, noisier, and easier to get wrong.
Practical tips
- Use OCR only when you have to: scanned PDFs, screenshots, canvas-rendered text, image-heavy pages
- Expect errors: confusing 0 and O, 1 and I, broken table structure, dropped punctuation
- Clean the image first if you control the pipeline: crop tightly, increase contrast, remove noise
- Validate extracted text against known patterns: dates, SKUs, totals, currency formats
- Keep a fallback path: HTML parsing first, OCR second
- Watch cost and latency: OCR adds both, especially at volume
- For production workflows, save the original image or PDF so you can debug bad reads later
Use cases
- Invoice and receipt scraping: many documents are scanned, so there is no usable HTML or embedded text layer
- PDF data extraction: annual reports, shipping documents, government filings, and forms often need OCR before you can parse anything useful
- Canvas-based sites: some sites render visible text into images or canvas elements to make scraping more annoying
- Screenshot pipelines: if your workflow captures a rendered page as an image, OCR is how you get text back out
OCR is one of those things that demos well and gets messy fast in production. It works, but you need to treat it like a fallback tool, not a clean substitute for structured data.