27 April, 2008

HowTo simulate Googlebot

Googlebot is Google's web crawling robot, which finds and retrieves pages on the web and hands them off to the Google indexer.
Googlebot visits sites with special value in his HTTP request header.

It uses special user-agent string:
"Mozilla/5.0 (compatible; Googlebot/2.1; http://www.google.com/bot.html)"

It is possible to simulate Googlebot from the shell script via wget program.

Like this:

#!/bin/bash

TEST_URL="http://digg.com/"

FIREFOX_USERAGENT_STRING="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14"
GOOGLEBOT_USERAGENT_STRING="Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

#get page for firefox browser
wget -c --user-agent="$FIREFOX_USERAGENT_STRING" --output-document=firefox.html "$TEST_URL"

#get page for google bot
wget -c --user-agent="$GOOGLEBOT_USERAGENT_STRING" --output-document=googlebot.html "$TEST_URL"


This script may be useful for testing of site's search engine optimization.

...May the Force be with you...