Sometimes you just want to know which pages of your website Google have within their huge index of URLs. Perhaps you need this information for a site migration to ensure all those important redirects are handled correctly. Perhaps you need the data as a crucial part of a technical website audit to check for signs of duplication or repetition. Or perhaps there’s another reason or you’re just curious! Whatever the reason this seemingly simple task of obtaining a list of URLs indexed by Google is challenging.
You may think that crawling the website with spider software such Xenu or Screaming Frog will give you a list of all available URLs but this only provides a list of all links accessible from within the website itself; not a list of all pages indexed by Google.
You would have thought Google could just provide the list but for whatever reason they do not currently share this information. Google Webmaster Tools (and Bing Webmaster Tools for that matter!) contains a feature which allows the webmaster to see the number of pages indexed but does not provide an option to export the list. Hopefully one day they’ll add this feature, but in the meantime you’ll have to resort to other methods.
I’m going to show you how you can extract a list of all URLs available from Google in 6 easy steps without scraping Google SERPs with automated tools or having the mundane task of manually copying and pasting each URL from a ‘site:’ search.
Disclaimer:Some may argue that this tutorial itself is a method of scraping Google search results, which I guess it kind of is; but in my mind methods of scraping often lean towards automated tools with malicious intent. What we’re going to do is not intended for malicious purposes, in fact it’s quite the opposite as it’ll help you, the webmaster, to understand which pages are indexed by Google and act accordingly. Plus if Google were to just provide this data we wouldn’t have to resort to these techniques!
SERP Link Extraction
You’re going to need to use Google Chrome to do this as you’ll need to install a Chrome extension.
Ready? Let’s go!
- Perform your site search - for example site:highposition.com
- Modify the number of results returned for your site search queryBy default, Google limit the number of search results to 10 per page so depending on the size of your site we’re going to need to increase this to 100 per page.Arguably we could proceed with 10 results per page instead of 100 but that approach will increase the query depth thus increasing the number of requests made to Google.To increase the number of results per page, click the Gear icon within the search results page and click ‘Search Settings’:Scroll down to ‘Google Instant predictions’, check ‘Never show Instant Results’ and increase the ‘Results per Page’ to 100:Make sure these settings are saved.
- Next we’re going to use a Chrome extension called gInfinity to remove the 100 results per page limit by seamlessly merging groups of SERPS into a single list.Head over to Chrome Web Store, download and install the extension.
- Go back to your list of the first 100 search results for your “site:” search query. Scroll down to the bottom of the page and watch as gInfinity automatically queries the next batch of results, displaying them all on a single page.Loop through this step until you have all URLs list on a single page.NOTE: If you’ve got a large website of 100,000’s of URLs you might want to be a little cautious here. We’re querying Google multiple times using gInfinity to render the data accordingly. The more data you query, the more Google gets suspicious, thus this isn’t exactly a Google friendly way of obtaining the list. That said, the worst Google will do is check that you’re a human through CAPTCHA validation.Also, you don’t necessarily have to extract all the data in a single hit. You could query 100 pages, extract the data, do the next 100 at a later date and so forth, amalgamating the data as you go. I prefer this method because it’s quick but the decision is yours!
- Now for the clever bit. Once you’ve got your list of URLs (however many that may be) you need to extract the data.
Drag and drop this bookmarklet into your ‘Bookmarks’ toolbar:
Make sure you have the Google SERPS in front of you, and click the bookmarklet:
A new window will then open listing all of the URLs and anchor texts:
- Now you can copy and paste the data and do with it what you wish.
That’s it! Six easy steps to obtain a list of all URLs indexed by Google. The bookmarlet includes three sets of data:
- Link and anchor table
- A complete URL list
- A complete anchor text list
You don’t necessarily have to use this for search results either. This tool can potentially extract link data from any given page as it’s essentially just a manipulation of a webpage’s source code.
For other useful bookmarlets check out this list of bookmarklets by High Position’s Head of Search, Tom Jepson, published last year. There are some great tools there, so be sure to check them out and let us know if you have any suggestions.
Finally, it would also be highly inappropriate if I didn’t give credit where credit is due to Liam Delahunty of Online Sales as my bookmarlet is a modified version of the original code by Liam customised to extract SERPs. Be sure to check out the other tools by Liam.
Update: 15th October 2014
I thought it would be worth mentioning that with Google continually tweaking the way in which search results are displayed, this bookmarklet will often require updating to the reflect the changes within Google’s code. I have just updated the bookmarklet to ignore the autogenerated inline sitelinks.
Extract URLs from Google Image Search
For those who found this tool useful I have recently (January 2015) published a new tool for extracting URLs from Google’s Image search. Enjoy!