Control What Gets Indexed – SEO Tip Week 19
Before reading this tip, make sure that you have read the tips from week 7 (Duplicate Content & URL Canonicalization) and 14 (Every Web Page Should Be Unique) to understand why unique content is valuable.
Under normal circumstances, you want every page to be unique, and for there to be one, and only one, URL displaying that content. However, there are situations where you might use multiple URLs for specific purposes, such as tracking your Pay Per Click (PPC) campaigns. Giving your PPC ads individual specific URLs makes them easier to track in your website statistics or analytics packages.
Example Original URL: www.company.com/mypage.html
Example PPC Ad URL: www.company.com/mypage.html?src=ppc001
The problem with giving ads unique URLs is that they may end up being indexed by the search engines, especially Yahoo. The best solution is to use a 301 redirect to the original URL for that page, thereby capturing any importance the individual URLs may have. If you cannot use a 301 redirect for technical or other reasons, you can stop these URLs from being indexed with your robots.txt file.
If you only have a few URLs that you want to disallow, you can list them individually:
User-agent: *
Disallow: /mypage.html?src=ppc001
Disallow: /mypage.html?src=ppc002
This can become problematic if you are trying to disallow hundreds or even thousands of different URLs. This is where wildcards come in. While wildcards are not part of the standards of robots.txt files, Google, MSN, and Yahoo all support them. This can be extremely helpful in trying to get rid of some of those tricky content duplication problems.
The two supported wildcards are the asterisk "*" which matches any sequence of characters, and the dollar sign "$" which signifies the end of a URL string. Trailing asterisks are redundant, and not needed, since that is the natural behavior of the robots.txt standard.
CompUSA Example
The CompUSA home page is indexed hundreds of times by Yahoo with many variations of referral and tracking parameters.
If they wanted to get rid of all URLs with parameters on their default.asp page, they could use:
User-agent: *
Disallow: /default.asp?
Additional Examples
If you have printable versions of all your HTML pages that contain "_print" in the URL, you would use:
User-agent: *
Disallow: /*_print*.html$
If you use a session id parameters called "sessionid" with users who are logged in, you would use:
User-agent: *
Disallow: /*sessionid=
If you have a private folder called "private_memories" and you don’t want hackers being able to know the full name of the folder simply by looking at your robots.txt file, you would use::
User-agent: *
Disallow: /private*/
As you can see, there are many uses for the robots.txt file now that all of the big 3 search engines support wildcards. Hopefully the official specification for robots.txt will support wildcards in the future, and all bots will understand them.






Want to know what Shell is doing?
6 Responses to “Control What Gets Indexed – SEO Tip Week 19”
By Shell Harris on May 13, 2007 | Reply
Chris,
That is a great tip and one that can literally bring your site back into the rankings practically overnight.
By Orderer on May 16, 2007 | Reply
Wow that compusa stuff is really weird.
By Perfect Wealth Formula on May 17, 2007 | Reply
Can you use the noindex meta tag to accomplish this?
By Chris Alexander on May 18, 2007 | Reply
Yes, the noindex meta tag will also accomplish this. However there are many instances where it’s difficult, or not possible, to use the noindex meta tag.
By Francisco Cheng on Jun 29, 2007 | Reply
Good post, this could be very useful information.
By Hello on Jun 10, 2008 | Reply
using your robots.txt file is definitely the best way to go. you should also use the meta noindex tag on the individual pages.