Control What Gets Indexed – SEO Tip Week 19
Before reading this tip, make sure that you have read the tips from week 7 (Duplicate Content & URL Canonicalization) and 14 (Every Web Page Should Be Unique) to understand why unique content is valuable.
Under normal circumstances, you want every page to be unique, and for there to be one, and only one, URL displaying that content. However, there are situations where you might use multiple URLs for specific purposes, such as tracking your Pay Per Click (PPC) campaigns. Giving your PPC ads individual specific URLs makes them easier to track in your website statistics or analytics packages.
Example Original URL: www.company.com/mypage.html
Example PPC Ad URL: www.company.com/mypage.html?src=ppc001
The problem with giving ads unique URLs is that they may end up being indexed by the search engines, especially Yahoo. The best solution is to use a 301 redirect to the original URL for that page, thereby capturing any importance the individual URLs may have. If you cannot use a 301 redirect for technical or other reasons, you can stop these URLs from being indexed with your robots.txt file.
If you only have a few URLs that you want to disallow, you can list them individually:
This can become problematic if you are trying to disallow hundreds or even thousands of different URLs. This is where wildcards come in. While wildcards are not part of the standards of robots.txt files, Google, MSN, and Yahoo all support them. This can be extremely helpful in trying to get rid of some of those tricky content duplication problems.
The two supported wildcards are the asterisk "*" which matches any sequence of characters, and the dollar sign "$" which signifies the end of a URL string. Trailing asterisks are redundant, and not needed, since that is the natural behavior of the robots.txt standard.
The CompUSA home page is indexed hundreds of times by Yahoo with many variations of referral and tracking parameters.
If they wanted to get rid of all URLs with parameters on their default.asp page, they could use:
If you have printable versions of all your HTML pages that contain "_print" in the URL, you would use:
If you use a session id parameters called "sessionid" with users who are logged in, you would use:
If you have a private folder called "private_memories" and you don’t want hackers being able to know the full name of the folder simply by looking at your robots.txt file, you would use::
As you can see, there are many uses for the robots.txt file now that all of the big 3 search engines support wildcards. Hopefully the official specification for robots.txt will support wildcards in the future, and all bots will understand them.