HOW TO FIND ALL EXISTING AND ARCHIVED URLS ON A WEBSITE

How to Find All Existing and Archived URLs on a Website

How to Find All Existing and Archived URLs on a Website

Blog Article

There are many factors you could will need to discover all the URLs on a website, but your actual intention will decide Anything you’re trying to find. By way of example, you might want to:

Discover every indexed URL to investigate difficulties like cannibalization or index bloat
Acquire existing and historic URLs Google has noticed, especially for site migrations
Come across all 404 URLs to Recuperate from post-migration problems
In Each and every state of affairs, just one Software won’t Provide you almost everything you need. However, Google Look for Console isn’t exhaustive, and a “website:instance.com” research is proscribed and tricky to extract details from.

In this particular article, I’ll stroll you through some tools to create your URL checklist and prior to deduplicating the info utilizing a spreadsheet or Jupyter Notebook, according to your site’s size.

Outdated sitemaps and crawl exports
In case you’re seeking URLs that disappeared from your Are living internet site not too long ago, there’s an opportunity anyone on your crew could possibly have saved a sitemap file or even a crawl export prior to the adjustments had been created. When you haven’t now, look for these files; they can normally give what you will need. But, in case you’re reading through this, you almost certainly did not get so lucky.

Archive.org
Archive.org
Archive.org is an invaluable Software for Website positioning tasks, funded by donations. In case you look for a domain and choose the “URLs” option, you are able to entry approximately ten,000 stated URLs.

However, there are a few restrictions:

URL Restrict: You are able to only retrieve nearly web designer kuala lumpur ten,000 URLs, that is insufficient for bigger internet sites.
Excellent: Several URLs may very well be malformed or reference resource information (e.g., pictures or scripts).
No export possibility: There isn’t a designed-in strategy to export the listing.
To bypass The dearth of the export button, use a browser scraping plugin like Dataminer.io. Having said that, these limitations signify Archive.org may not give an entire Alternative for larger sized web sites. Also, Archive.org doesn’t suggest irrespective of whether Google indexed a URL—but if Archive.org identified it, there’s a superb prospect Google did, as well.

Moz Pro
Even though you might usually use a hyperlink index to discover external websites linking to you, these tools also discover URLs on your web site in the procedure.


The best way to use it:
Export your inbound one-way links in Moz Pro to acquire a fast and straightforward list of concentrate on URLs out of your website. For those who’re addressing an enormous Web-site, consider using the Moz API to export details past what’s workable in Excel or Google Sheets.

It’s important to Notice that Moz Pro doesn’t verify if URLs are indexed or discovered by Google. Even so, due to the fact most sites apply the identical robots.txt policies to Moz’s bots since they do to Google’s, this technique generally functions effectively being a proxy for Googlebot’s discoverability.

Google Look for Console
Google Look for Console offers several valuable resources for setting up your listing of URLs.

Back links studies:


Similar to Moz Professional, the Back links segment provides exportable lists of focus on URLs. Regretably, these exports are capped at 1,000 URLs Every. You can implement filters for specific web pages, but because filters don’t implement to the export, you would possibly need to depend upon browser scraping applications—limited to 500 filtered URLs at any given time. Not great.

Overall performance → Search Results:


This export provides you with a listing of web pages acquiring research impressions. When the export is proscribed, you can use Google Search Console API for larger datasets. There's also free of charge Google Sheets plugins that simplify pulling much more extensive details.

Indexing → Web pages report:


This section provides exports filtered by challenge sort, though these are definitely also constrained in scope.

Google Analytics
Google Analytics
The Engagement → Internet pages and Screens default report in GA4 is an excellent source for accumulating URLs, which has a generous limit of 100,000 URLs.


Better still, it is possible to implement filters to make diverse URL lists, proficiently surpassing the 100k Restrict. One example is, if you would like export only blog URLs, abide by these ways:

Action 1: Include a phase to your report

Phase 2: Simply click “Produce a new phase.”


Action 3: Determine the segment which has a narrower URL pattern, such as URLs that contains /weblog/


Take note: URLs present in Google Analytics may not be discoverable by Googlebot or indexed by Google, but they offer important insights.

Server log files
Server or CDN log documents are perhaps the last word Resource at your disposal. These logs seize an exhaustive list of every URL path queried by buyers, Googlebot, or other bots through the recorded period of time.

Considerations:

Information measurement: Log documents might be large, lots of web pages only keep the final two weeks of data.
Complexity: Examining log data files is usually complicated, but several instruments can be found to simplify the process.
Incorporate, and fantastic luck
When you finally’ve gathered URLs from all these resources, it’s time to combine them. If your web site is small enough, use Excel or, for larger datasets, instruments like Google Sheets or Jupyter Notebook. Assure all URLs are persistently formatted, then deduplicate the record.

And voilà—you now have a comprehensive list of latest, aged, and archived URLs. Superior luck!

Report this page