Thumbnail
One look is worth a thousand words.
 
Hello!

So you read your web server log and you come to these lines:
***.***.***.*** - - [27/Oct/2005:07:23:54 +0200] "GET /foo/ HTTP/1.1" 403 1890 "-" "thumbnail.cz robot 1.1 (http://thumbnail.cz/why-no-robots-txt.html)" "-"
***.***.***.*** - - [27/Oct/2005:07:23:54 +0200] "GET /foo/ HTTP/1.1" 200 1704 "-" "Mozilla/5.0 (compatible; Konqueror/3.3; Linux) KHTML/3.3.2 (like Gecko)" "-"
And you have some question?

Why we visited your site?

Becouse somebody want screenshot of your page.

Why two visits?

In first visit we fetch your page. Check if it really exist, if there is some flash and so on. The second visit is from Konqueror (real browser) from we take screenshot.

How you find my site?

Becouse it is listed in some catalogue. Usualy in Dmoz.org.

This is my secret page! You should not visit it!

As I said, the page is listed somewhere. You should use some access control (like .htacces in apache).

Why you ignore robots.txt?

Becouse this is not crawler. We usualy fetch just one page. Our robot do not recursively visit other link from your page. Our robot usualy visit just first page of your site. So we could not accidentally submit some of your forms. We visit just the page which is listed in some catalogue. If we fetch robots.txt it will be just one another hit to your web site.

This is very good reasoning of itubert from Dmoz.org:
This is the historical reason for the existence of robots.txt (emphasis mine):
http://www.robotstxt.org/wc/norobots.html wrote:
In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting).

Even the difference between a "browser" and a "robot" is not as clear-cut as some people think. There difference that matters is between "just fetching a page" and "crawling recursively and automatically" (just visiting the front page is not crawling). If you are just fetching (not crawling) one or a few specific pages from a site--pages that are accessible and any browser could visit--there's no "moral" difference between using internet explorer, wget, or a perl script to visit them. Especially if they are visited as a result of a decision by a human user. If I, the human, tell my user-agent "visit this page, dammit!", it better ignore robots.txt. Same if I tell it "visit these 10 pages", or "visit these 4 million pages". The responsibility is mine, not the "robot's". Otherwise, using your logic, it would be impossible to write a useful link checker (for example, Robozilla!), because it couldn't visit pages that are "forbidden" by robots.txt. It's also worth mentioning that the robots exclusion protocol is voluntary, and as such it can only work when there's a benefit to both parties. Crawlers (real crawlers) obviously benefit from not getting stuck in infinite virtual trees or fetching duplicated information. Link checkers and thumbshot generators that are based on a finite list of links that were built by humans don't, because they don't crawl.

HOME    MANAGE ACCOUNT    CONTACT    PRODUCTS   
Copyright © 2004-2014 Miroslav Suchy. All rights are reserved