Control What Gets Indexed – SEO Tip Week 19

Written on May 11, 2007 – 1:47 pm | by Shell Harris |

52 SEO TipsBefore reading this tip, make sure that you have read the tips from week 7 (Duplicate Content & URL Canonicalization) and 14 (Every Web Page Should Be Unique) to understand why unique content is valuable.

Under normal circumstances, you want every page to be unique, and for there to be one, and only one, URL displaying that content. However, there are situations where you might use multiple URLs for specific purposes, such as tracking your Pay Per Click (PPC) campaigns. Giving your PPC ads individual specific URLs makes them easier to track in your website statistics or analytics packages.

Example Original URL: www.company.com/mypage.html
Example PPC Ad URL: www.company.com/mypage.html?src=ppc001

The problem with giving ads unique URLs is that they may end up being indexed by the search engines, especially Yahoo. The best solution is to use a 301 redirect to the original URL for that page, thereby capturing any importance the individual URLs may have. If you cannot use a 301 redirect for technical or other reasons, you can stop these URLs from being indexed with your robots.txt file.

If you only have a few URLs that you want to disallow, you can list them individually:

User-agent: *
Disallow: /mypage.html?src=ppc001
Disallow: /mypage.html?src=ppc002

This can become problematic if you are trying to disallow hundreds or even thousands of different URLs. This is where wildcards come in. While wildcards are not part of the standards of robots.txt files, Google, MSN, and Yahoo all support them. This can be extremely helpful in trying to get rid of some of those tricky content duplication problems.

The two supported wildcards are the asterisk "*" which matches any sequence of characters, and the dollar sign "$" which signifies the end of a URL string. Trailing asterisks are redundant, and not needed, since that is the natural behavior of the robots.txt standard.

CompUSA Example

The CompUSA home page is indexed hundreds of times by Yahoo with many variations of referral and tracking parameters.

If they wanted to get rid of all URLs with parameters on their default.asp page, they could use:

User-agent: *
Disallow: /default.asp?

Additional Examples

If you have printable versions of all your HTML pages that contain "_print" in the URL, you would use:

User-agent: *
Disallow: /*_print*.html$

If you use a session id parameters called "sessionid" with users who are logged in, you would use:

User-agent: *
Disallow: /*sessionid=

If you have a private folder called "private_memories" and you don’t want hackers being able to know the full name of the folder simply by looking at your robots.txt file, you would use::

User-agent: *
Disallow: /private*/

As you can see, there are many uses for the robots.txt file now that all of the big 3 search engines support wildcards. Hopefully the official specification for robots.txt will support wildcards in the future, and all bots will understand them.

Shell Harris co-founded Big Oak on January 1, 2004. In a previous career he was a print & web designer and often developed the sites he designed before focusing on his current passion for search engine optimization and Internet marketing. He is an avid researcher, SEO specialist, company mouthpiece and is always looking for the next big thing in Internet Marketing.

Facebook Twitter LinkedIn Google+  

  1. 6 Responses to “Control What Gets Indexed – SEO Tip Week 19”

  2. By Shell Harris on May 13, 2007 | Reply

    Chris,
    That is a great tip and one that can literally bring your site back into the rankings practically overnight.

  3. By Orderer on May 16, 2007 | Reply

    Wow that compusa stuff is really weird.

  4. By Perfect Wealth Formula on May 17, 2007 | Reply

    Can you use the noindex meta tag to accomplish this?

  5. By Chris Alexander on May 18, 2007 | Reply

    Yes, the noindex meta tag will also accomplish this. However there are many instances where it’s difficult, or not possible, to use the noindex meta tag.

  6. By Francisco Cheng on Jun 29, 2007 | Reply

    Good post, this could be very useful information.

  7. By Hello on Jun 10, 2008 | Reply

    using your robots.txt file is definitely the best way to go. you should also use the meta noindex tag on the individual pages.

Sorry, comments for this entry are closed at this time.

Big Oak SEO Blog

This SEO blog is provided by Big Oak SEO, a SEO Company. Most blog posts are related to search engine optimization, short reviews, SEO tips and increasing site conversions. Email us at contact@bigoakinc.com or give us a call 804-741-6776 to see how we can help your company. More

Want to know what Shell is doing?
Follow Shell with Twitter, just don't expect too much.

Want to subscribe?

 Subscribe in a reader Or, subscribe via email:    
Enter your email address:  
Find entries :