Building eCommerce XML sitemaps with ScreamingFrog SEO Spider

This blog post is the first in a series of eCommerce SEO posts focusing on technical SEO, including; product 301’s for eCommerce site migration, writing meta data at scale, combining website scraping and Google Analytics.

One of the key battles in the eCommerce SEO arena is to ensure that target search engines fully index your website. To make this easier most search engines allow you to submit a XML sitemap, containing a list of all the URLs on your site and helping the search engine to find the pages.

Given that there are thousands of sitemap tools on the internet, and the complexity of an eCommerce website, it’s worth picking your tool carefully. One tool that stands out for its ease of use is ScreamingFrog SEOSpider; its filters make it ideal for building an eCommerce XML sitemap.

So what are the steps required to build an eCommerce sitemap?

1: Configuring the crawler

Some eCommerce websites can have tens or even hundreds of thousands of URLs, and take hours to crawl, so it’s important to make sure your sitemap tool is configured correctly, with any filters set-up before you start.

To start your sitemap make sure you have a few different things filtered out in the settings, specifically images, CSS, Javascript and external links, in addition to any pages in listed robots.txt; this ensures that your sitemap tool only crawls the pages required.

To filter out the elements listed above, open up the ‘Spider’ configuration menu, under ‘Configuration’ > ‘Spider’ in file menu and un-tick all the options, leaving only ‘Check Links Outside of Start Folder’ ticked.

1

In addition to the above set-up you may also wish to exclude additional pages from the website crawl. For instance (depending on your canonical set-up), you may wish exclude a filtered version of product category pages, such as a ‘view all’ version of your page, rather than a paginated version.

To do this from the menu, select ‘Configuration’ > ‘Exclude’ and enter your exclusion filter. This will exclude a group of pages with the same URL string, using the wild card operator as show below.

2

Finally, before you start running your spider it’s important to remember that, depending on the size of your website, it could take several hours to crawl. With this in mind, it’s a good idea to make sure you’ve allocated the maximum amount of memory (Ram) possible, otherwise on large websites it can slow down your crawl speed. There’s a guide on ScreamingFrog.co.uk to help you with this. Generally I allocate as much ram as possible, leaving 1 GB spare, e.g. if I have 4GB of ram, I may allocate 3GB to SFSS.

2: Start your website crawl

Once your configuration is complete, enter your domain name and start the website crawl. Be patient as it could take several hours.

In some instances, the process of crawling your website may be too intensive for your servers. This may result in some pages producing a ‘no response’ error. If this occurs ask SFSS to try to crawl those pages again, ensuring that your sitemap includes as much of your website as possible.

To get SFSS to re-crawl your ‘no response’ URLs, follow the below steps:

  1. While the spider is running or finished, navigate to the ‘Response Codes’ tab.
  2. From the filter above the grid select ‘No Response’
  3. Select all entries in the list. (This can be done by pressing Ctrl + A)
  4. Click ‘Re-Spider’ from the context menu. (Right click)

Any URL which didn’t receive a page response the first time around is re-crawled. Depending on the size of your website you may need to repeat the above process more than once, to ensure that every page on your website has been crawled.

3

3: Sitemap clean-up and checks

Before you export your sitemap, make certain that there aren’t any unwanted URLs – such as 404’s or pages with rel=”canonical” tags.

Remove broken and re-direct URLs

Navigate to the ‘Response Codes’ tab and select ‘Client Error 4XX’ from the filter menu. Delete all the URLs listed by selecting one URL, then press ‘Ctrl + A’ to select all the URLs; press the delete key.

Now repeat the above process for the ‘Server Error 5XX’ and ‘Redirection 3XX’ filters within the ‘Response Code’ tab.

4

Remove non-canonical and no-index pages

Navigate to the ‘Directives’ tab and select ‘Canonical’ from the filter menu; delete all the URLs listed. To do this repeat the same process as above; select one URL first, select all the URLs with ‘Ctrl + A’, then press the delete key.

Repeat this process again for the ‘No-Index’ filter with the ‘Directives’ tab.

5

Remove all external URLs

Finally Navigate to the ‘External’ tab and select ‘All’ from the filter menu; delete all the URLs listed.

6

4: Key landing page check

Make sure all your top 10 landing pages are included with one final check through the URLs.

To do this, run the landing page report for organic traffic within Google Analytics. Try finding the URLs in SFSS, using the search bar, with the ‘Internal’ tab open.

7

5: Export your sitemap

Once you’re satisfied that you’ve successfully crawled your website to include all relevant page, and carried out the necessary clean-up, it’s time to export your sitemap.

Before exporting your sitemap it’s always worth saving the copy of the current crawl file in case you need to investigate it at a later date, this can be done from the file menu.

To export your XML sitemap simply select ‘Advanced Export’ > ‘XML Sitemap’ from the file menu…and you’re done!

10

Look out for the second instalment in this blog series: Combining Excel and ScreamingFrog SEO spider to build product 301’s for eCommerce websites.

by Austin Waddecar on 30/09/2013

Let's talk

We're here to help. Send us a message and we'll get back to you as soon as possible.

Get in touch

Guides and Resources

View Resource Library