Glossary

Docker

Docker packages an application and its runtime into a container so it runs the same way on your laptop, CI, and production. For scraping, that matters because browsers, system libraries, fonts, and anti-bot workarounds tend to break in slightly different ways on every machine if you do not pin the environment.

Examples

A basic scraping container might install Python, your scraper code, and the browser dependencies in one image:

FROM python:3.11-slim

RUN apt-get update && apt-get install -y \
    wget curl ca-certificates fonts-liberation libnss3 libatk-bridge2.0-0 \
    libxcomposite1 libxdamage1 libxrandr2 libgbm1 libasound2 libgtk-3-0 \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .
CMD ["python", "scrape.py"]

And then run it like this:

docker build -t my-scraper .
docker run --rm my-scraper

A lot of teams start here, then discover the real problem is not packaging the scraper. It is keeping headless browsers stable, memory use sane, and networking clean once the job volume goes up.

Practical tips

  • Keep images small: start from a slim base image, install only what the scraper actually needs, clean apt caches.
  • Expect browser pain: Chrome and Playwright inside containers are heavier than people think, and memory limits will absolutely bite you.
  • Pin versions: browser, driver, Python packages, system libs. "Works on my machine" is mostly version drift.
  • Separate concerns: scraper code in one layer, dependencies in another, secrets passed at runtime instead of baked into the image.
  • Watch resource limits: CPU throttling, low shared memory, and container memory caps cause weird browser crashes.
  • Use Docker when you need reproducibility: not because containers are trendy, but because scraping stacks are fragile.
  • Do not overdo it: for a tiny requests-only script running on one box, Docker may be useful but not essential.
  • If you are using ScrapeRouter, Docker becomes less critical for browser management itself because you are not the one shipping and maintaining the browser stack. You still might use it for your app, workers, and deployment pipeline.

A common production run command for browser-heavy jobs looks more like this:

docker run --rm \
  --shm-size=1g \
  --memory=2g \
  --cpus=1.5 \
  my-scraper

Use cases

  • Packaging a scraper so it runs the same in local development, CI, and production.
  • Running scheduled scraping workers in Kubernetes, ECS, Nomad, or plain Docker hosts.
  • Shipping browser automation with the right libraries, fonts, and certificates already installed.
  • Isolating multiple scraping jobs with different dependency stacks.
  • Reproducing bugs: if a target site only breaks in production, a container makes that environment easier to debug.
  • Standardizing team workflows: one command to build, one command to run, less "what version are you on?" nonsense.

Related terms

Headless Browser Playwright Proxy Rotation Rate Limiting Anti-Bot Web Scraping API