Glossary

HAR

HAR stands for HTTP Archive, a JSON file format that records everything a browser loaded for a page: requests, responses, headers, timings, redirects, and more. In scraping, HAR files are useful because they show what the site actually does in the browser, which is often the fastest way to find hidden APIs, auth flows, and the request sequence you need to reproduce.

Examples

A HAR file is mostly useful during debugging and reverse engineering. You open a page in Chrome DevTools, export the network log, and inspect the requests that returned the data you actually care about.

{
  "log": {
    "entries": [
      {
        "request": {
          "method": "GET",
          "url": "https://example.com/api/products?page=1"
        },
        "response": {
          "status": 200,
          "content": {
            "mimeType": "application/json"
          }
        },
        "time": 184.32
      }
    ]
  }
}

Once you spot the real data endpoint in the HAR, you can often replay it directly instead of scraping rendered HTML.

curl 'https://example.com/api/products?page=1' \
  -H 'accept: application/json' \
  -H 'x-requested-with: XMLHttpRequest' \
  -H 'cookie: session=abc123'

Practical tips

  • Filter noise fast: look for fetch and XHR requests first, then GraphQL, then JSON responses. Ignore fonts, images, and analytics unless the site is doing something weird.
  • Pay attention to sequence: some requests only work after a config call, CSRF token fetch, or session cookie is set. The HAR shows that order, which matters more than people think.
  • Check headers and payloads: auth tokens, cursor params, locale settings, and client hints are often the difference between 200 and 403.
  • Watch for short-lived values: HAR files can include temporary cookies, bearer tokens, and signed URLs. Good for debugging, bad if you treat them like permanent inputs.
  • Don’t overfit to one capture: one HAR from one browser session is a clue, not the full system. Repeat the flow a few times and compare what changes.
  • Be careful with secrets: HAR exports can contain cookies, auth headers, and personal data. Don’t paste them into tickets or ship them around Slack like it’s nothing.
  • Use HAR to reduce browser usage: if the page data comes from a clean JSON endpoint, you may not need full browser automation in production at all. That saves money and removes a lot of failure modes.

Use cases

  • Finding hidden APIs: a page looks heavily rendered, but the actual data comes from /api/search or a GraphQL POST behind the scenes.
  • Reproducing browser requests: you need to copy the exact headers, cookies, and payload shape that made the request work.
  • Debugging blocked scrapers: the browser succeeds, your script gets blocked, and the HAR helps you compare what is missing.
  • Understanding pagination: page numbers are often fake; the HAR shows the real cursor, offset, or continuation token.
  • Reducing maintenance: instead of parsing unstable HTML, you move to the underlying JSON call the frontend already depends on.
  • Validating render necessity: if the HAR shows the data is fetched client-side after load, you can decide whether you need JavaScript rendering or just the backend request.
  • Troubleshooting scraping infrastructure: when using a scraper API or router layer, HAR-style inspection helps separate target-site issues from proxy, header, or session problems.

Related terms

XHR GraphQL Headless Browser CSRF Token Session Cookie API Endpoint Request Headers Web Scraping