Keeping Google Out with robots.txt

an article added by: Carlos Torres at 04302007


In: Categories » Internet and online » Search engines » Keeping Google Out with robots.txt

This article is about partnering with Google: getting into the index, improving your PageRank, advertising on Google, distributing other people’s Google ads on your site, and other ways of building your online business through Google. So a section about rebuffing Google might seem counterproductive. But in the interest of covering all bases, here it is. Sometimes even publicity-hungry Webmasters want to keep Google away from certain parts of their business. Private pages designed for friends and semiprivate pages created for select visitors shouldn’t be indexed for the world at large. Entire sites that are still under development while existing on the Web in a live state might best be excluded from Google. It’s fairly easy to prevent Google from indexing an entire site or selected pages of a site even if the spider crawls your URL.

You can prevent Google also from caching pages of your site, a process by which Google stores each indexed page on its servers. This section explains how to prevent Google from crawling and caching your site. Deflecting the crawl The key to deflecting Google’s spider is the robots.txt file, also known as the Robots Exclusion Protocol. Google’s spider understands and obeys this protocol. The robots.txt file is a short, simple text file that you place in the toplevel directory (root directory) of your domain server. (If you lease your Web space from your ISP, not from a dedicated Web host, you probably need administrative help in placing the robots.txt file.) Create the robots.txt file in Notepad or another text editor, and transfer it as an ASCII text file. It’s best not to use Microsoft Word or another word processor to create the robots.txt file. But if you do, remember to save it as a plain text file with the .txt file extension. Then make sure you transfer it to your server as a binary file, which is the default setting of many FTP (file transfer protocol) programs. The robots.txt file contains two instructions:

-  User-agent. This instruction specifies which search engine crawler must follow the robots.txt instructions. You may specify Google’s spider, multiple specific spiders, or all spiders. (The command works for all spiders that seek and acknowledge the robots.txt file.)

-  Disallow. This line specifies which directories (Web page folders) or specific pages at your site are off-limits to the search engine. You must include a separate Disallow line for each excluded directory.

The robots.txt resource site

The information in this article gives you everything you need to construct an effective robots.txt file. If you want to know more, such as a list of spider names and general information about crawlers, go to the Web Robots Page here:

www.robotstxt.org

The FAQ (frequently asked questions) section at this site is particularly useful: www.robotstxt.org/wc/faq.html

A sample robots.txt file looks like this:

User-agent: *
Disallow: /

This example is the most common and simplest robots.txt file. The asterisk after User-agent means all spiders are excluded. The forward slash after Disallow means all site directories are off-limits. The name of Google’s spider is Googlebot. (I would have preferred Charlotte.) If you want to exclude only Google and no other search engines, use this robots.txt file: User-agent: Googlebot Disallow: / You may identify certain directories as out-of-bounds, either to Google or all spiders.

For example:

User-agent: *
Disallow: /cgi-bin/
Disallow: /family/
Disallow: /photos/

Notice the forward slashes at both ends of the directory strings in the preceding example. Google understands that the first slash implies your domain address before it. So, on the first Disallow line, if that line were found at the bradhill.com site, would be shorthand for http://www.bradhill.com/cgi-bin/

 and Google would know to exclude that directory from the crawl. The second forward slash means you’re excluding an entire directory. To exclude individual pages, type the page address following the first forward slash, and leave off the second forward slash, like this:

User-agent: *
Disallow: /family/reunion-notes.htm
Disallow: /blog/archive00082.htm

Each excluded directory and page must be listed on its own Disallow line. Do not group multiple items on one line. To exclude a certain type of file, use the asterisk followed by the file extension on the Disallow line, like this:

User-agent: *
Disallow: /family/*.jpg

This example tells all spiders to exclude .jpg files (a certain type of picture file) from indexing. In Google’s case, this sort of command is apt because Google devotes an entire search engine to images (www.google.com/images ). If you want to exclude all images on your site from the Google Images index, use a robots.txt file with the name of Google’s Image spider, which is Googlebot-Image:

User-agent: Googlebot-Image Disallow: /

Remember that your graphic logos are also included in this broad exclusion, and therefore won’t turn up in Google’s image search. That omission is normally not a problem and doesn’t affect the display of your images when people visit your site. Use the asterisk-plus-extension technique to exclude any type of file from the crawl, such as .doc and .pdf files. Effects of the robots.txt file are not immediate in many cases, especially when you’re trying to exclude a page that’s currently included. First, you must wait for the spider to crawl your site again, and your site’s crawl cycle could be daily, monthly, or sometime in between, depending on its PageRank. Second, the page you want excluded, if previously included, will live on in Google’s cache for some time. (See the next section for information about requesting removal from the cache and avoiding the cache from the start of a page’s life.)

You may adjust the robots.txt file as often as you’d like. It’s a good tool when building fresh pages that you don’t want indexed while still under construction. When they’re finished, take them out of the robots.txt file. Excluding pages with the meta tag In some situations, using a meta tag to deflect spiders is easier than constructing a robots.txt file. If you code your HTML by hand, as opposed to using graphic design programs such as Dreamweaver or Front Page, throwing in the meta tag is a piece of cake. Also, if you want to exclude only one page, or the occasional page here and there, the meta tag option could be easier. Using both meta tags and the robots.txt file is fine. Not all spiders understand the meta tag described here, but Google does. Note: See Article 3 for the effective use of other meta tags that are part of the site optimization process.

You place meta tags after the <head> tag at the top of an HTML document. (Note that meta tags can be uppercase or lowercase.) To dissuade the Google spider from indexing any individual page of your site, put this tag among your other meta tags in that page’s HTML: <meta name=”robots” content=”noindex, nofollow”> Note the two commands, noindex and nofollow. The first prevents Google from indexing your page, and the second prevents Google from following links on that page. If you want the page to be excluded from the index but would like Google to follow its outgoing links, leave off the nofollow command, like this:

<meta name=”robots” content=”noindex”>

Make your command Google-specific by using the name of Google’s spider, Googlebot:

<meta name=”googlebot” content=”noindex, nofollow”>

Avoiding the cache Other meta commands prevent pages from being copied into Google’s cache. The cache is a storehouse of Web pages copied by Google. Clicking the Cached link on a search results page quickly brings up the page as it appeared when last crawled, which might be different than it appears now, live on the Web. This feature is great for Google’s consumer users. I used it recently after watching David Letterman complain about the CBS.com site, which hosted a photo of archrival Jay Leno. By the time Letterman’s rant aired, late at night, CBS had already changed the site by replacing Leno’s picture with Letterman’s. I wanted to see the original gaffe, so I hit the Cached link in Google, and there it was. Frequently crawled sites that make major updates daily, such as Slate.com, generally run about a day behind in the Google cache. Site owners are not universally happy about the Google cache. For one thing, the cache treads upon a gray area of copyright infringement, since Google does not obtain authorization to make copies of the sites it crawls. (Google does remove cached links upon request.) Second, when Webmasters change a page, they want it changed! Often, as in the CBS example, the site’s owner does not want people like me dredging up old mistakes. Prevent any page from entering the Google archive with the following meta tag:

<meta name=”googlebot” content=”noarchive”>

Extend the command to all spiders fluent in meta tag commands by replacing googlebot with robots:

<meta name=”robots” content=”noarchive”>

The invisibility problem Deflecting Google’s spider when it reaches your site is easy enough, as the previous sections explain. A bigger problem is when Google reaches your site, but can’t see it. The spider is well equipped to make fine distinctions about your content, HTML tags, and link network, but it is a creature of simple tastes. Creating a site using certain technologies stumps the Google arachnid and sends it scurrying away empty-handed. In particular, three factors are apt to frustrate or displease Google:

-  Frames. Frames have been generally loathed since their introduction in the HTML specification early in the Web’s history. They wreak havoc with the Back button, and they confuse the fundamental format of Web addresses (one page per address) by dividing one page address into multiple portions that operate like little, independent Web pages. However, frames do have legitimate uses. Google itself uses frames to display threads in Google Groups (see Article 4). But the Google spider turns up its nose when it encounters frames. Framed pages are not necessarily excluded from the index. But errors can ensue hurting both the index and your visitors either your framed pages won’t be included, or searchers are sent to the wrong page because of addressing confusion. If you do use frames, make your site Google-friendly (and human-friendly) by providing links to unframed versions of the same content, as Google does in Google Groups. These links give Google’s diligent spider another route to your valuable content, and your visitors get a choice of viewing modes everybody wins.

-  Splash pages. Splash pages (not to be confused with doorway pages) are content-empty entry pages to Web sites. You’ve probably seen them. Some splash pages employ cool multimedia introductions to the content within useless and invisible to Google. Others are mere static welcome mats that force users to click again before getting into the site. Google does not like pointing its searchers to splash pages. In fact, these tedious welcome mats are bad site design by any standard, even if you don’t care about Google indexing, and I recommend getting rid of them. Give your visitors, and Google, meaningful content from the first click, and you’ll be rewarded with happier visitors and better placement in Google’s index.

-  Dynamically generated pages. A dynamic page is one that is created on the fly based on choices made by the site visitor. Sites that pull their content from databases (XML sites provide a good example) generate dynamic pages. When Google crawls such a site, it can generate huge numbers of pages, sometimes crashing the site or its server. The Google spider picks up some dynamically generated pages, but generally backs off when it encounters dynamic content. As a result, the site’s content, hidden in its database, remains invisible to Google. The spider can’t collect it, evaluate it, index it, or apply PageRank to it. (Weblog pages do not fall into this category they are dynamically generated by you, the Webmaster, but not by your visitors.) Inadvertent invisibility is a good segue to the next article, which deals with design issues of all sorts in the quest to optimize pages for Google’s spider.

legal notice

Our website is not responsible for the information contained by this article. Web-articles is a free articles resource.
Suggestion: If you need fresh, daily updated content for your website, feel free to use our service. Click here for more information.

Useful tools and features

Link to this article from your page    Send this article to you or to a friend
If you like this article (tutorial), please link to it from your web page using the information above.

related articles

1. Optimizing a Site for Google
The field of search engine optimization (SEO) is both simple and complex. It’s simple in that the principles of preparing your site for beneficial crawling are a lot easier than SEO companies (who want you as a client) might have you believe. It’s also complex because ideal SEO goes beyond tweaking a site’s tags or page structure to a deeper consideration of a site’s purpose, who it wants to attract, and how it wants visitors to behave. SEO might or might not be connected to making money. (Fo...

2. Putting Google Search on Your Site
The simplest and most identifiable method of partnering with Google is to incorporate Google searching on your site. You may offer Google search to your visitors free of charge (to them and to you), and you may customize the search to a reasonable degree. Giving your users options to search the Web or your site (or other specific sites) is fairly easy. Google offers four free search services and three paid services: -  Google Free. A Google-branded search box that delivers Web results. ...

3. Introducing Search Advertising and Google AdWords
This first article on AdWords is an overview of both search advertising in theory and AdWords in practice. I sketch the main points of Google’s service here, and get into the details in later articles. Search advertising brings new marketing propositions to the table. This is not to say that search advertising is brand new, but it is reaching a tipping point (to borrow author Malcolm Gladwell’s phrase). Nobody knows what we are tipping into. But there’s no question that search adve...

4. Understanding How AdWords Works
As a preview, the following list outlines the basic steps of designing and running ads in Google, in roughly the order in which most people proceed: -  Start an account. Starting an AdWords account is pain-free and expensefree. You don’t even have to be certain that you’ll ever run a single ad. Opening the account simply lets you into Google’s AdWords staging area, called the Control Center, where you create and deploy campaigns. No ads are displayed, and no billin...

5. Creating Effective Ad Groups
Ad Groups are the fundamental marketing units that propel your AdWords campaign. If keywords are the sparks of AdWords success, Ad Groups are the flames. And, one hopes, your campaign is a roaring bonfire. But forget the heated analogy. The point is that success in AdWords depends largely on the effective creation and manipulation of Ad Groups. Why is the Ad Group the most powerful element of your campaign? Because it contains the four motors of your advertising and conversion strategy: ads, keywords, bid...

6. AdWords bid on keywords
The Control Center provides three ways to edit the crucial CPC (cost-per-click) bid. This is the bid that helps determine your ad’s position on search pages. Normally, the bid applies to all keywords in an Ad Group, but you may also specify unique bids for individual keywords. Following are the three methods of tweaking your CPC bid: -  Using the Edit Keywords link. I describe this method in the preceding section, in the discussion about editing keywords. The same screen allows keyword ed...

7. Managing AdWords Campaigns
This article is about the daily operation of AdWords campaigns. I emphasize five important topics in this article: -  Pausing and resuming campaigns and Ad Groups -  Understanding why accounts are slowed, and knowing how to reactivate a slowed account -  Coping with slowed and disabled keywords, situations that can be baffling to the uninitiated -  Understanding and choosing geo-targeting -  Implementing Google’s conversion tracking feature Pausing and Resuming...

8. Getting into Froogle and Google Catalogs
Because of the huge amount of publicity doled out to AdWords and AdSense, you might think that Google’s business services are only advertising services. Not true. Google is really in the exposure business, increasing visibility for both advertisers and sites listed in the Google indexes including its two shopping indexes, the subjects of this article. To put Google’s business services in an even broader light, you might say that Google is in the keyword business. As a keyword services company, Google bri...