A frequent Search Engine Optimization question is “how do search engines such as Google handle JavaScript and CSS?“
Historically, search engines processed web pages much like an old text video browser such as lynx. A search engine only “saw” what the simplest browser could display – simple html.
Much for this reason, search engine optimization consultants have long advocated that site developers keep site coding simple, avoid hiding navigation systems in JavaScript menus and the like.
Today the situation is more complex. Google and the other search engines will try to extract links from anything they can – from PDF files to JavaScript embedded in a web page. This process is not foolproof, however – a site should still avoid relying solely on a JavaScript based navigation system, especially when CSS is a better choice.
We can verify that Google knowingly downloads JavaScript and CSS code when this code is packaged in an external include file. The verification process is fairly simple if you have access to your website’s web server log files. Some hosting companies, like Italy’s otherwise well thought of Aruba, don’t provide access to server logs with their shared hosting services. You might want to exclude companies which don’t fully support web analytics from consideration when choosing your hosting.
To verify Google is downloading your css and/or JavaScript files, search for googlebot and your file, i.e.
grep Googlebot access.log | grep "\.js"
where access.log represents your web server log file, your external javascript file has a .js suffix and your operating system knows what grep is (if it doesn’t, try grep for windows or change your operating system!).
You should see output that includes lines similar to this:
66.249.66.73 – - [04/May/2007:16:09:36-0700] “GET /j/newslink.js HTTP/1.1″ 200 943 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
We only have one type of link to this file, a script declaration: <script type="text/javascript" src="/j/newslink.js"> (clearly tagged as javascript). The file is not listed in an xml sitemap.
So it appears that, yes, Googlebot is identifying and downloading our JavaScript files!
But unfortunately, we’re not yet finished. We need to verify that Googlebot is really Googlebot and not someone pretending to be Googlebot. Why someone would want to spoof Googlebot is a subject for another post, but suffice to say, it is easy to do, using Firefox’s UserAgent Switcher or other tools.
So how can we verify that Googlebot came from Google? The easiest way is to insure the IP address maps to Googlebot’s crawler, and back. This is an example using Linux:
$ host 66.249.66.73
73.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-73.googlebot.com.
$ host crawl-66-249-66-73.googlebot.com
crawl-66-249-66-73.googlebot.com has address 66.249.66.73
So we’ve verified Googlebot, from 66.249.66.73 aka crawl-66-249-66-73.googlebot.com, is actively looking for and downloading our JavaScript include files.
For CSS,
grep Googlebot access.log | grep "\.css"
returns
66.249.66.73 – - [07/May/2007:09:10:07-0700] “GET /c/screen.css HTTP/1.1″ 200 11056 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
Naturally, Googlebot probably doesn’t crawl your include files every day, so you’ll need to check log files for an appropriate time period. You can perform similar checks for Yahoo! Slurp (crawl.yahoo.net) and Ask Jeeves/Teoma (Sample host name: egspd42146.ask.com; the 5 digit number is variable).
Currently Google’s use of these files, beyond simple link extraction, is probably limited to search engine spam analysis.
We haven’t seen any sign that text embedded in JavaScript write statements, or hidden by CSS, actually shows up in search engine queries. In other words, from a search engine ranking point of view, Google will see a web page much like the Lynx browser. Until Google supports a partial page noindex mechanism similar to Yahoo!’s robots-nocontent option, that’s probably the way it should be. Google has direct experience of what can go wrong when bots execute web code, albeit on improperly coded websites.
Do keep in mind that what is true today may not be true tomorrow. Google’s technical prowess should never be underestimated.
Similar Posts:
- Google Crawling and Execution of JavaScript: where are we at today?
- Web Analytics: Embedded JavaScript Page Tracking Code vs. Web Server Log Files
- Web Analytics Embedded JavaScript Page Tracking Code: Place at the top or bottom of the page?
- 6 methods to control what and how your content appears in search engines
- Creating Search Engine Friendly Drop-down menus using CSS
If you haven't already, you might subscribe to my feed by Email, RSS feed and/or follow me on Twitter, which is updated on a more frequent – and more meaningless – basis. Finally, if you're a Sphinn user, Sphinn love is welcome :-). Thanks for visiting!
Share


0 responses so far ↓
There are no comments yet...Kick things off by filling out the form below.
Leave a Comment
Warning: Comments are welcome insofar as they add something to the discussion. Anonymous and/or polemical comments without a rational justification of the author's position risk being mercilessly deleted at the sole discretion of the administrator. Yes, life is hard :-).