Glossary

Canonical URL

A canonical URL is the preferred URL for a page when the same content is available at multiple URLs. It tells search engines which version should be treated as the main one, so ranking signals do not get split across duplicates, parameterized URLs, or near-identical versions.

Examples

A few common cases:

  • Product pages with tracking params: /product/123?utm_source=x should point to /product/123
  • Same page on multiple paths: /blog/post and /blog/post/ should agree on one version
  • Filtered or sorted listings: /shoes?sort=price may canonically point to /shoes if the content is not meant to rank separately

HTML example:

<link rel="canonical" href="https://example.com/product/123" />

If you're scraping a page and want to inspect the canonical URL, you're usually looking for the rel="canonical" link in the head:

from bs4 import BeautifulSoup
import requests

html = requests.get("https://example.com/product/123?utm_source=ad").text
soup = BeautifulSoup(html, "html.parser")
canonical = soup.find("link", rel="canonical")
print(canonical["href"] if canonical else None)

With ScrapeRouter:

curl -X POST https://www.scraperouter.com/api/v1/scrape/ \
  -H "Authorization: Api-Key $api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/product/123?utm_source=ad"
  }'

Practical tips

  • Do not treat the canonical tag as ground truth: sites misconfigure this all the time, point everything to the homepage, or forget to update it after migrations.
  • Compare the fetched URL, final URL after redirects, and canonical URL: if all three disagree, that is usually a useful signal that the site has URL hygiene problems.
  • For scraping and deduplication, canonical URLs are helpful but not enough: pages with no canonical tag, broken tags, or cross-domain canonicals still happen a lot in production.
  • Normalize URLs before storing them: strip obvious tracking parameters, normalize trailing slashes where appropriate, and keep the raw source URL too.
  • If you crawl large sites, use canonical URLs to reduce duplicate work, but do not blindly drop non-canonical pages until you verify the site is consistent.
  • Watch for these failure modes: self-referencing canonicals missing, paginated pages all canonically pointing to page 1, faceted navigation collapsing to one URL, canonical tags generated differently on mobile vs desktop.

Use cases

  • SEO debugging: figuring out why the page you expect is not the one Google seems to care about.
  • Scraper deduplication: grouping multiple URL variants under one preferred page so you do not store the same thing five times.
  • Crawl efficiency: cutting down wasted requests on parameter spam, sort orders, session IDs, and duplicate paths.
  • Site migrations: checking whether canonicals and redirects agree after a domain, path, or CMS change.
  • Marketplace and ecommerce monitoring: product URLs often show up with affiliate parameters, campaign tags, and internal tracking junk; canonical URLs help you collapse those variants into one record.

Related terms

Duplicate Content Redirect Meta Robots Sitemap URL Parameter Crawl Budget Robots.txt