15 May 2003. A subscriber asked whether using the Google cache can, in effect, protect you from malicious Web sites and let you surf anonymously. Gary Price, Greg Notess and Michael Fagan respond.
Gary Price: When Googlebot crawls a Web page, it copies (up to 101k) and stores it on a
Google server. So, when you click on the cached copy, you are not connecting to the live site.
Accessing this cached page (example
deleted) records a hit on the Google server. However, following any link on the Google cached page will connect you to the airport's Web site. (Gary Price runs
The ResourceShelf.)
Greg Notess: Gary is right about Google grabbing just the first 101K of a Web page, but it captures only the first 101K of HTML and underlying code. So the
images, for example, will still come from the original server. Therefore, using Gary's example of the Harrisburg airport site, most of the Web page, which consists of images, actually loads live from the flyhia.com domain.
This reliance on the original server can go beyond images to Java, JavaScript, or other scripting or programming aspects of more complex Web design. I have
displayed some pages in Google's cache that try to load the entire page from the original server. A simple redirection script may cause this to happen.
So for these reasons, I would not depend on Google's cache to adequately
make surfing anonymous. One small image on the original Web page could, in fact, track your visit even to Google's cache version. The Internet Archive's
Wayback Machine
is safer, since some images are cached as well. But even that is not foolproof. (Greg Notess runs
Search Engine Showdown.)
Michael Fagan: Google caches only the text of the Web page. All the external files (Javascript, Cascading Style Sheets, images, Flash, etc.) are not saved by Google. If a Web site deletes its style sheet, for instance, and the old page is viewed in Google's cache, it will render without the style sheet. Same thing for images, etc.
When a browser requests a file from a Web server, that request is stored in the Web site's logs. So it is only if a Web page has no external files that you will not show up in someone's Web log by viewing Google's cache.
Aside from Google, there are a number of other tools that cache Web pages. Gary Price mentions
several, and notes that my
URLinfo
is useful for this. Note that Google News no longer provides cached
pages.
Also, you really can't surf using Google's cache because Google doesn't change the page except to put its header at the top. The links still take you to the original Web site. This differs from Google's translation feature. If you translate a Web page, and then follow a link from that page, you go to the translated version of the page rather than the original page. The Wayback Machine also changes the links of cached pages, which means you can navigate the cached pages of a site. (Michael Fagan runs
Fagan Finder.)
|