SEO Tutorial – How To Improve Your SEO With Robots.txt and Canonical Headers | In this post we will explain you How To Improve Your SEO With Robots.txt and Canonical Headers.

Search engine crawlers (aka spiders or bots), scan your site and index whatever they can. This happens whether you like it or not, and you might not like sensitive or autogenerated files, such as internal search results, showing up on Google.


Fortunately, crawlers check for a robots.txt file at the root of the site. If it’s there, they’ll follow the crawl instructions inside, but otherwise they’ll assume the entire site can be indexed.

Here’s a simple robots.txt file:

  • The first line explains which agent (crawler) the rule applies to. In this case, User-agent: * means the rule applies to every crawler.
  • The subsequent lines set what paths can (or cannot) be indexed. Allow: /wp-content/uploads/ allows crawling through your uploads folder (images) and Disallow: / means no file or page should be indexed aside from what’s been allowed previously. You can have multiple rules for a given crawler.
  • The rules for different crawlers can be listed in sequence, in the same file.

Robots.txt Examples

This rule lets crawlers index everything. Because nothing is blocked, it’s like having no rules at all:

This rule lets crawlers index everything under the “wp-content” folder, and nothing else:

This lets a single crawler (Google) index everything, and blocks the site for everyone else:

Some hosts may have default entries that block system files (you don’t want bots kicking off CPU-intensive scripts):

Block all crawlers from a specific file:

Block Google from indexing URLs with a query parameter (which is often a generated result, like a search):

Google’s Webmaster tools can help you check your robots.txt rules.

Setting Up a “Crawler Friendly” CDN

Crawlers see your site the way you do, including loading content from CDNs which means that your images are being pulled from CDN and as such google won’t be touching origin server to index your images since origin is not used any more to deliver your image files.

  • Interesting fact is that google treats subdomains of your site used for static files delivery, with more “respect” than 3rd party domains used for same purpose. Therefore, it is highly recommended that you setup a CNAME for your CDN files and add it to Google WebMaster tools if you’d want to monitor the index rate for images.
  • To make sure your CDN is treating crawlers with appropriate terms you need make sure nothing but images is accessible for crawlers from CDN servers – unless you are using full site cache method of delivery. Your origin server has its own robots.txt, available at the root of the site and it’s probably allowing every page and image to be indexed from it. On the CDN, change your custom robots.txt settings (under the “SEO” tab in the control panel) and make sure that only images are “open” for indexing (and/or any other html page on which you’ve added canonical header as well):

Make sure this robots.txt rule goes on your CDN, not your origin server.

  • For WordPress websites and Yoast SEO plugin there is a short code snippet to add into the function where image links are being generated from:

And update your existing sitemaps since above code will produce different urls for images.

  • For any HTML page being served from CDN, it is good to have canonical headers added because as much as google don’t care for canonicals on images, it honours same header on HTML files so, we can use the rel=”canonical” header to indicate the original source of a page (rel=”canonical” can work inside HTML tags and as a separate HTTP header). Crawlers that attempt to index a file from the CDN will see the canonical URL and store that, improving your SEO.

Here’s a sample .htaccess configuration for your origin server:

Purge CDN to even out headers on CDN. To check and confirm canonicals are applied to CDN assets as well use CURL as follows: