Pinpoint Real Duplicates with Data, Not Guesswork

Start by quantifying, not hypothesizing. Crawl your site with an enterprise-grade crawler and cluster pages by similarity using word shingles or fuzzy hashing. Layer in Search Console data: check the Indexing report for “Duplicate, Google chose different canonical” and “Duplicate without user-selected canonical.” Validate a sample with URL Inspection to see which signals (canonical, redirects, internal links) are being weighed.

Go beyond HTML. Inventory parameterized URLs, case variations, trailing-slash variants, HTTP vs. HTTPS, and subdomain duplication (www vs. non-www, m-dot). Don’t forget non-HTML assets—PDFs, print pages, and feed endpoints can silently split equity. Cross-domain duplications from syndication, staging sites left open, or CDN test paths often fly under the radar; add them to your dataset.

Prioritize by impact. Combine crawl clusters with log files to see what bots actually fetch and with analytics to see where users land. High-impression clusters with mixed canonicals, paginated categories mistakenly consolidated, and product variant explosions (color/size) come first. Decide the “source of truth” per cluster before you touch a single tag.

Consolidate Authority: Canonicals That Convert

Choose one canonical per intent, then make every signal sing the same tune. Use a clean 200 URL with self-referential in the head. Mirror the choice in XML sitemaps (include only canonicals), internal links (update nav, breadcrumbs, and related modules), and structured data (point URLs at the canonical). For non-HTML (PDF, feeds), set the Link: rel="canonical" HTTP header.

Keep canonicals honest. They’re a hint, not a mandate, so don’t mix them with conflicting directives. Avoid pairing rel=canonical with noindex on the same page. Never canonicalize paginated results to page 1 unless you have a performant view-all; instead, self-canonical each page. For international twins, use self-canonical plus hreflang across locales—do not cross-canonical between countries or languages.

Handle special ecosystems with precision. If you still run an m-dot, use rel="alternate" media for mobile and canonical back to desktop (or go responsive). For syndicated content, require partners to point a cross-domain canonical to your original or delay your publication there. For product variants, consolidate near-identical SKUs to a parent PDP, or differentiate variants with unique copy, images, and attributes before you break them out.

Merge, Redirect, or Rewrite: Keep Equity Intact

If content is truly duplicative and the destination is permanent, ship a 301 redirect—full stop. Map at the most specific level you can, avoid chains, and resolve to HTTPS on a single canonical host. Use 302/307 only when the move is temporary. For dead ends, 410 can be cleaner than 404, but only if there’s no close replacement; otherwise, consolidate to the best-fit canonical.

When duplication is partial, merge and enrich. Consolidate the strongest page, fold in the unique value from weaker siblings, and redirect them to the winner. Preserve relevance by porting internal anchor text, on-page sections, FAQs, and schema. After merging, update internal links and—where feasible—ask top referring domains to switch their backlinks to the canonical target.

If neither redirect nor canonical fits—because the intent is genuinely different but copy overlaps—rewrite. Elevate uniqueness with specific data, proprietary images, demos, FAQs, and comparisons. Thin templates that cause mass dupe (e.g., 3-line location pages) need substantial unique value or a noindex policy. Aim for one page per intent; when in doubt, converge rather than proliferate.

Lock It Down: Prevent Future Dupes at Scale

Normalize URLs at the edge. Enforce one protocol and host, collapse case, standardize trailing slashes, and strip tracking parameters from indexable paths. Kill session IDs and print pages. Prefer clean URL paths over query-string facets; if facets must exist, restrict crawl with a combination of internal link restraint, meta robots noindex,follow, and, once deindexed, selective robots.txt disallows. The old Google URL Parameters tool is gone—own this logic yourself.

Bake guardrails into your CMS. Auto-insert a single, correct self-canonical on every template. Block publishing if title/heading/slug are near-duplicates of an existing canonical. Require unique meta titles and enforce minimum body length where appropriate. Generate sitemaps from the canonical index only and refresh lastmod values accurately to keep crawlers focused.

Monitor relentlessly. Set dashboards for canonical mismatches, duplicate clusters, stray 200s at non-canonical hosts, and redirect chains. Watch Search Console for “Google chose different canonical” drift, and validate samples monthly. Use log files to see bots hitting retired URLs; keep redirects alive long enough to pass equity, then consider sunsetting with 410 once traffic and links taper.

Duplicate content is a solvable systems problem: identify the clusters, select a single source of truth, consolidate signals, and harden the platform so dupes can’t respawn. Do it with discipline and you won’t merely protect rankings—you’ll concentrate them, simplify crawling, and accelerate every future launch. Clean signals compound.

Perplexity isn’t a search engine in the traditional sense. It’s an answer engine. Users ask a question. Perplexity scans live web sources, summarizes the findings, and shows citations directly inside the answer. There’s no “page one.” There’s no ten blue links. You’re...