Examples
A typical OAuth 2.0 flow in scraping-adjacent work looks like this:
- User logs in with an identity provider like Google
- Your app gets an authorization code through a redirect
- Your backend exchanges that code for an access token
- API requests use that token until it expires
- A refresh token may be used to get a new access token without logging in again
curl -X GET "https://api.example.com/private-data" \
-H "Authorization: Bearer access_token_here"
A token exchange usually happens server-side:
curl -X POST "https://oauth2.example.com/token" \
-H "Content-Type: application/x-www-form-urlencoded" \
-d "grant_type=authorization_code" \
-d "code=received_auth_code" \
-d "client_id=your_client_id" \
-d "client_secret=your_client_secret" \
-d "redirect_uri=https://your-app.com/callback"
In Python, calling an API with an OAuth access token is just an authenticated HTTP request:
import requests
access_token = "access_token_here"
resp = requests.get(
"https://api.example.com/private-data",
headers={"Authorization": f"Bearer {access_token}"},
timeout=30,
)
print(resp.status_code)
print(resp.text)
The annoying production part is not the header. It is everything around it: redirects, token expiry, consent screens, refresh logic, CSRF state, and provider-specific quirks.
Practical tips
- Don’t treat OAuth 2.0 as a login shortcut. It is an authorization system, and providers bolt authentication on top of it.
- Keep the full token lifecycle in mind: access token expiry, refresh token storage, revoked consent, scope changes.
- For scraping, check whether you actually need browser automation. If the site exposes an API behind OAuth, using the API is usually cheaper and more stable.
- If the target uses OAuth only to create a browser session, the real work may still be session cookies, CSRF tokens, and anti-bot checks after login.
- Never hardcode short-lived access tokens into jobs that run for hours. They will die mid-run.
- Store refresh tokens carefully and server-side. If you leak them, you effectively leaked long-term access.
- Expect provider differences: same spec, different behavior. Google, Microsoft, GitHub, and custom enterprise IdPs all have their own edges.
- If you are scraping pages behind OAuth-based login, budget for maintenance. Login flows break more often than static extraction logic.
- If you just need public pages rendered behind some app shell, a scraping API like ScrapeRouter can help with the page retrieval side, but it does not magically replace OAuth consent and token handling you are responsible for.
A simple refresh pattern looks like this:
import requests
resp = requests.post(
"https://oauth2.example.com/token",
data={
"grant_type": "refresh_token",
"refresh_token": "stored_refresh_token",
"client_id": "your_client_id",
"client_secret": "your_client_secret",
},
timeout=30,
)
print(resp.json())
Use cases
- Accessing a user-authorized API instead of scraping HTML: Google APIs, Microsoft Graph, GitHub data.
- Pulling data from internal dashboards where access is delegated through an identity provider.
- Building a scraper for customer-owned accounts where the customer explicitly authorizes access.
- Maintaining long-running sync jobs that need token refresh instead of repeated interactive logins.
- Avoiding password handling entirely when a platform provides a supported OAuth integration.
Where people get burned:
- They automate the login popup once, think they’re done, then the token expires in an hour.
- They scrape browser pages when the same data is available through a cleaner OAuth-protected API.
- They ignore scopes, then wonder why some endpoints work and others return 403.
- They forget that consent can be revoked, which means production jobs silently start failing later.