OAuth 2.0 | ScrapeRouter

OAuth 2.0 is an authorization framework that lets an app get limited access to a user’s account or data without handling the user’s password directly. In practice, it’s the thing behind "Sign in with Google" and a lot of API access flows, and it matters in scraping because authenticated sessions often depend on short-lived tokens, redirects, scopes, and refresh logic.

Examples

A typical OAuth 2.0 flow in scraping-adjacent work looks like this:

User logs in with an identity provider like Google
Your app gets an authorization code through a redirect
Your backend exchanges that code for an access token
API requests use that token until it expires
A refresh token may be used to get a new access token without logging in again

curl -X GET "https://api.example.com/private-data" \
  -H "Authorization: Bearer access_token_here"

A token exchange usually happens server-side:

curl -X POST "https://oauth2.example.com/token" \
  -H "Content-Type: application/x-www-form-urlencoded" \
  -d "grant_type=authorization_code" \
  -d "code=received_auth_code" \
  -d "client_id=your_client_id" \
  -d "client_secret=your_client_secret" \
  -d "redirect_uri=https://your-app.com/callback"

In Python, calling an API with an OAuth access token is just an authenticated HTTP request:

import requests

access_token = "access_token_here"
resp = requests.get(
    "https://api.example.com/private-data",
    headers={"Authorization": f"Bearer {access_token}"},
    timeout=30,
)

print(resp.status_code)
print(resp.text)

The annoying production part is not the header. It is everything around it: redirects, token expiry, consent screens, refresh logic, CSRF state, and provider-specific quirks.

Practical tips

Don’t treat OAuth 2.0 as a login shortcut. It is an authorization system, and providers bolt authentication on top of it.
Keep the full token lifecycle in mind: access token expiry, refresh token storage, revoked consent, scope changes.
For scraping, check whether you actually need browser automation. If the site exposes an API behind OAuth, using the API is usually cheaper and more stable.
If the target uses OAuth only to create a browser session, the real work may still be session cookies, CSRF tokens, and anti-bot checks after login.
Never hardcode short-lived access tokens into jobs that run for hours. They will die mid-run.
Store refresh tokens carefully and server-side. If you leak them, you effectively leaked long-term access.
Expect provider differences: same spec, different behavior. Google, Microsoft, GitHub, and custom enterprise IdPs all have their own edges.
If you are scraping pages behind OAuth-based login, budget for maintenance. Login flows break more often than static extraction logic.
If you just need public pages rendered behind some app shell, a scraping API like ScrapeRouter can help with the page retrieval side, but it does not magically replace OAuth consent and token handling you are responsible for.

A simple refresh pattern looks like this:

import requests

resp = requests.post(
    "https://oauth2.example.com/token",
    data={
        "grant_type": "refresh_token",
        "refresh_token": "stored_refresh_token",
        "client_id": "your_client_id",
        "client_secret": "your_client_secret",
    },
    timeout=30,
)

print(resp.json())

Use cases

Accessing a user-authorized API instead of scraping HTML: Google APIs, Microsoft Graph, GitHub data.
Pulling data from internal dashboards where access is delegated through an identity provider.
Building a scraper for customer-owned accounts where the customer explicitly authorizes access.
Maintaining long-running sync jobs that need token refresh instead of repeated interactive logins.
Avoiding password handling entirely when a platform provides a supported OAuth integration.

Where people get burned:

They automate the login popup once, think they’re done, then the token expires in an hour.
They scrape browser pages when the same data is available through a cleaner OAuth-protected API.
They ignore scopes, then wonder why some endpoints work and others return 403.
They forget that consent can be revoked, which means production jobs silently start failing later.