Enhanced Robots.pm for AWStats: Recognize Additional Robots in your Web Analytics Reports

Bookmark this resource

AWStats Logo

Accurate recognition of Internet robots is essential for separating automated traffic out from real human page views when processing your Web Analytics reports. AWStats implements a modular approach to robot recognition: all robots are contained in a perl module separate from the main AWStats program. We've been adding robots we come across – more than 150 to date. Where possible, we also provide a link to the robot's home page which integrates in the AWStats reports.

To update AWStats robot detection, just replace your existing robots.pm file with our version here.

WarningYou should be aware that adding additional robots to your Web Analytics processing will increase the time necessary to process your logs – this is the price you'll pay for greater accuracy. Recognition of newly added robots is not retroactive – you must reprocess your logs if you want to see new robots for historical data.

To avoid false robot recognition, we often must use strings including spaces (indicated by using the regex \s); it seems we could use _ but we could not get this to work. Microsoft's IIS replaces spaces in the Useragent with the + character. We considered using the regular expression (\s|\+) to support both spaces as well as the + used by MS IIS, but we didn't want to add the additional performance overhead. Thus, IIS users should manually change \s to \+. Eventually we may investigate why the underscore wasn't recognized as an equivalent by AWStats as this solution is less than ideal.

Download our updated robots.pm and save it in the AWStats lib directory, after having made a copy of your existing version. Consider backing up your AWStats statistics (intermediary) files as well. They are usually in the AWStats DirData directory. The library should be backwardly compatible.

NoteSee our updated AWStats Search Engine Database and Browsers Database and our other AWStats web analytics resources as well!

Last updated: 2006-10-15

Selected Robots Databases

Several sites regularly document known information about the various web robots, including suggestions on which robots may be worth blocking from your site as their intentions are not ethical.

Robot Detection

We have enhanced the current Robots database:

Added:

  1. Argus www.simpy.com
  2. BecomeBot link http://www.become.com/site_owners.html
  3. bender focused_crawler
  4. BlogPulse (ISSpider-3.0) intelliseek.com
  5. Blogshares Spiders (Synchronized V1.5.1)
  6. Blogslive info@blogslive.com intelliseek.com
  7. BlogsSay :: RSS Search Crawler (http://www.blogssay.com/)
  8. ConveraCrawler/0.9d ( http://www.authoritativeweb.com/crawl)
  9. dipsie (not tested with real data).
  10. DomainsDB.net http://domainsdb.net/
  11. EverbeeCrawler
  12. Feedfetcher-Google (http://www.google.com/feedfetcher.html)
  13. Gaisbot/3.0 (robot05@gais.cs.ccu.edu.tw; )
  14. geniebot (wgao@genieknows.com)
  15. Girafabot http://www.girafa.com/
  16. ia_archiver-web.archive.org (was inadvertently grouped with Alexa traffic)
  17. MJ12bot http://majestic12.co.uk/bot.php
  18. NG/1.x & 2.x. Seen from http://www.exabot.com/
  19. Nutch (used by looksmart (furl?))
  20. OpenTaggerBot (http://www.opentagger.com/opentaggerbot.htm)
  21. OutfoxBot/0.3 (For internet experiments; outfox.agent@gmail.com)
  22. PluckFeedCrawler http://www.pluck.com/
  23. Powermarks; seen used by referrer spam
  24. rssImagesBot
  25. RufusBot Rufus Web Miner http://64.124.122.252.webaroo.com/feedback.html
  26. Seekbot (http://www.seekbot.net/bot.html)
  27. Sqworm
  28. t\-h\-u\-n\-d\-e\-r\-s\-t\-o\-n\-e
  29. topicblogs http://www.topicblogs.com/
  30. w3c-checklink
  31. w3c css-validator
  32. yacy
  33. Yahoo-Blogs http://help.yahoo.com/help/us/ysearch/crawling/crawling-02.html
  34. Yahoo-MMCrawler/3.x (mms-mmcrawler-support@yahoo-inc.com)
  35. YahooSeeker
  36. YahooSeeker-Testing
  37. fixed Feedfetcher-Google (http://www.google.com/feedfetcher.html)
  38. documentation link to bot home pages for above and selected major bots.
    • In the case of international bots, choose .com page.
    • Included tool tip (html "title").
    • To do: parameterize to match both AWStats language and tooltips settings.
    • To do: add html links for all bots.
  39. changed '\wbot[\/\-]', to '\wbot[\/\-]' (removed comma)
  40. made minor grammar corrections to notes

The above changes are already included in the 2005-11-26 version of AWStats 6.5.

The following changes, posted here 2005-12-15, may not yet be in 6.5:

  1. FAST Enteprise Crawler
  2. findlinks
  3. IBM Almaden Research Center WebFountain™
  4. INFOMINE VLCrawler
  5. lmspider
  6. noxtrumbot
  7. SandCrawler (Microsoft)
  8. SBIder
  9. SeznamBot
  10. sohu-search
  11. the ruffle SemanticWeb crawler
  12. WebVulnCrawl libwww-perl
  13. Yahoo! Japan keyoshid
  14. Y!J
  15. link for GigaBot
  16. link for Magpie RSS
  17. link for MSIECrawler

2005-12-22: More additions.

  1. aipbot
  2. EARTHCOM.info
  3. Everest-Vulcan Inc
  4. Fast-Search-Engine
  5. g2Crawler
  6. HTTrack off-line browser
  7. Jakarta commons-httpclient May be used as robot or browser - a site may want to remove this entry.
  8. KummHttp
  9. OmniExplorer_Bot
  10. USTC-Semantic-Group

2006-01-13: More additions.

  1. Dulance
  2. MojeekBot
  3. nicebot
  4. Snappy
  5. sohu agent
  6. TencentTraveler
  7. VORTEX
  8. zspider
  9. boitho.com-dc
  10. IRLbot
  11. virus_detector
  12. Wavefire
  13. WebFilter Robot

2006-01-24

  1. Shim-Crawler
  2. Exabot
  3. LetsCrawl.com
  4. ichiro

2006-01-27 22 robots from a list supplied by Moizes Gabor

  1. ALeadSoftbot
  2. CipinetBot
  3. Cuasarbot
  4. Dumbot
  5. Extreme_Picture_Finder
  6. Fooky.com/ScorpionBot/ScoutOut
  7. IlTrovatore-Setaccio
  8. InsurancoBot
  9. InternetArchive
  10. KazoomBot
  11. Kurzor
  12. NutchCVS
  13. NutchOSU-VLIB
  14. Orbiter
  15. PHP_version_tracker
  16. SuperBot
  17. SynooBot
  18. TestBot
  19. TutorGigBot
  20. UP.Browser
  21. WebIndexer
  22. WebMiner

2006-02-01: More additions. Most from a list provided by Moizes Gabor [ mojzi -a-t- free mail hu ]

  1. heritrix
  2. Zeus Webster Pro
  3. Candlelight_Favorites_Inspector
  4. DomainChecker
  5. EasyDL
  6. FavOrg
  7. Favorites_Sweeper
  8. Html_Link_Validator
  9. Internet_Ninja
  10. JRTwine_Software_Check_Favorites_Utility
  11. Microsoft_URL_Control
  12. miniRank
  13. Missigua_Locator
  14. NPBot
  15. Ocelli
  16. Onet.pl_SA
  17. proodleBot
  18. SearchGuild_DMOZ_Experiment
  19. Susie
  20. Website_Monitoring_Bot
  21. Xenu_Link_Sleuth

2006-05-15: Added more robots; made two changes:

  1. fixed Missigua Locator detection (Missigua_Locator -> Missigua Locator)
  2. changed echo to echo! to avoid conflict with the bonecho (Firefox 2.0) browser which needs to be added to the browsers detection file. This change requires you to reprocess historic logs if you want EchO! to be recognized for older reports.
  1. ASPseek
  2. AdamM Bot
  3. archive.org_bot
  4. arianna.libero.it (Italian Portal/search engine)
  5. Biz360 spider
  6. BlogBridge Service
  7. BlogSearch
  8. Crawl libcrawl
  9. edgeio-relanshanbottriever
  10. FeedFlow
  11. Biblioteca Nazionale Centrale di Firenze (Italian National Archive)
  12. Java catchall - used by many spam bots
  13. lanshanbot
  14. msnbot-media
  15. msrabot
  16. MT::Telegraph::Agent
  17. Netluchs (German SE bot)
  18. oBot
  19. Onfolio (IE Toolbar plugin)
  20. ping.blo.gs
  21. sogou spider
  22. sogou test
  23. Sphere Scout
  24. sproose crawler
  25. SyndicAPI
  26. Vagabondo
  27. Vagabondo-WAP
  28. Yahoo! Mindset

2006-05-17

  1. Alpha Search Agent (from IP 62.152.125.60)
  2. Krugle
  3. Octora Beta Bot
  4. UbiCrawler
  5. Yahoo! Slurp China. Note: retroactive recognition of Yahoo! Slurp China requires reprocessing your old log files - regenerating the AWStats statistics intermediary files.

2006-05-20 80+ robots, many from a list supplied by Moizes Gabor [ mojzi -a-t- free mail hu ]

  1. 1-More Scanner
  2. Accoona-AI-Agent
  3. ActiveBookmark
  4. BIGLOTRON
  5. Bookmark-Manager
  6. cbn00glebot
  7. Cerberian Drtrs
  8. CFNetwork
  9. CheckWeb link validator
  10. Computer and Automation Research Institute Crawler
  11. ConveraCrawler
  12. ConveraMultiMediaCrawler
  13. CSE HTML Validator Lite Online
  14. Cursor
  15. Custo
  16. DataFountains/DMOZ Downloader
  17. Deepindex
  18. DNSGroup
  19. DoCoMo
  20. dumm.de-Bot
  21. ETS v
  22. eventax
  23. FAST Enterprise Crawler
  24. FAST Enterprise Crawler * crawleradmin.t-info@telekom.de
  25. FAST Enterprise Crawler * T-Info_BI_cluster crawleradmin.t-info@telekom.de
  26. FeedValidator
  27. FilmkameraBot
  28. Findexa Crawler
  29. Global Fetch
  30. GoForIt.com
  31. GOFORITBOT
  32. GPU p2p crawler
  33. HooWWWer
  34. HPPrint
  35. HTMLParser
  36. Hundesuche.com-Bot
  37. InfoBot
  38. InfociousBot
  39. InternetSupervision
  40. isearch2006
  41. IUPUI_Research_Bot
  42. KalamBot
  43. kamano.de NewsFeedVerzeichnis
  44. Kevin
  45. KnowItAll
  46. Knowledge.com
  47. Kouaa Krawler
  48. ksibot
  49. Link Valet Online
  50. lwp-request
  51. lwp-trivial
  52. MapoftheInternet.com
  53. Matrix S.p.A. - FAST Enterprise Crawler
  54. Megite
  55. Metaspinner
  56. Mini-reptile
  57. Misterbot
  58. Miva
  59. Mizzu Labs
  60. MS SharePoint Portal Server - MS Search 4.0 Robot
  61. MSRBOT
  62. Mydoyouhike
  63. NASA Search
  64. NetSprint
  65. NimbleCrawler
  66. OpenWebSpider
  67. Oracle Ultra Search
  68. OSSProxy
  69. passwordmaker.org
  70. PEAR HTTP Request class
  71. PEERbot
  72. PHP version tracker
  73. PictureOfInternet
  74. plinki
  75. Port Huron Labs
  76. PostFavorites
  77. ProjectWF-java-test-crawler
  78. PyQuery
  79. Schizozilla
  80. Scumbot
  81. Sensis Web Crawler
  82. snap.com beta crawler
  83. Steeler
  84. STEROID Download
  85. Suchfin-Bot
  86. Sunrise
  87. Tagyu Agent
  88. Tcl http client package
  89. TeragramCrawlerSURF
  90. Test Crawler
  91. UnChaos Bot Hybrid Web Search Engine
  92. unido-bot
  93. UniversalFeedParser
  94. updated
  95. Vermut
  96. versus crawler from eda.baykan@epfl.ch
  97. Vespa Crawler
  98. VSE
  99. Web Downloader
  100. webcrawl.net
  101. Webdup
  102. Wells Search
  103. WordPress
  104. wume crawler
  105. xirq
  106. yoogliFetchAgent
  107. Z-Add Link Checker
  1. changed (fix) Xenu Link Sleuth; added Xenu's Link Sleuth (with ')
  2. changed (fix) favorites\ssweeper -> favorites\ssweeper
  3. changed (fix) microsoft\_url\_control -> microsoft\surl\scontrol
  4. updated AskJeeves bot description -> Ask

2006-05-23 10 robots

  1. DataparkSearch
  2. FurlBot/Furl Search
  3. Kyluka crawl
  4. MonkeyCrawl
  5. page_verifier
  6. SeznamTestBot
  7. Szukacz
  8. UMBC-memeta-Bot
  9. WebAlta Crawler
  10. Zhuaxia

2006-05-27 4 robots

  1. AdsBot-Google
  2. HTTPFetcher
  3. MVAClient
  4. ISC Systems iRc Search

2006-06-13 14 robots

  1. BeijingCrawler
  2. Crawler Mozilla
  3. Googlebot-Image
  4. Googlebot-Mobile
  5. gsa-crawler
  6. ISC Systems iRc Search
  7. LapozzBot
  8. NaverBot
  9. NextGenSearchBot
  10. Nusearch Spider
  11. psycheclone
  12. SnapBot
  13. Snoopy
  14. WebsiteWorth
  15. Yahoo-MMAudVid

2006-06-26: 7 robots

  1. AvantGo
  2. EmeraldShield.com Web Spider
  3. Forex Trading Network
  4. Honda-Search
  5. kykapeky
  6. schibstedsokbot
  7. WIRE

2006-08-25: 17 robots

  1. AIrobot
  2. BecomeJPBot
  3. ccubee
  4. Charlotte
  5. DepSpid
  6. Evaal
  7. focused_crawler
  8. Google-Sitemaps
  9. H.H.G. bot
  10. iaskspider
  11. KSE_Spider
  12. LocalcomBot
  13. MS SharePoint Portal Server - MS Search 5.0 Robot
  14. MyFamilyBot
  15. PediaSearch.com Crawler
  16. robots/1.0 (MSIE 6.0)
  17. SrevBot

2006-09-07: 6 robots

  1. TheSuBot
  2. TMCrawler
  3. gonzo1[P]
  4. BilgiBetaBot
  5. TurnitinBot
  6. SEO[.AG]

2006-10-15: 38 robots

  1. 8484 Boston Project
  2. AnswerBus
  3. China Local Browse
  4. csci b659
  5. ejupiter.com
  6. Extreme Picture Finder
  7. Factbot
  8. Favcollector
  9. gonzo2[P]
  10. HBZ-Digibib
  11. Html Link Validator
  12. HyperEstraier
  13. IEAutoDiscovery
  14. InterNetMedia.hu
  15. IntranetSearchEngine
  16. IUPUI Research Bot
  17. KakleBot
  18. LinkLint-checkonly
  19. LinkProver
  20. MFC Tear Sample
  21. moiNAG
  22. NG 3.x.
  23. NG-SearchBot
  24. RAMPyBot
  25. RPT-HTTPClient
  26. ShopWiki
  27. SquidClamAV Redirector
  28. Toutatis
  29. UnChaos
  30. Verzamelgids
  31. VIPr
  32. Watchfire WebXM
  33. WebarooBot
  34. WebCorp
  35. webGobbler
  36. West Wind Internet Protocols
  37. Wildsoft Surfer
  38. WorQmada
 

Improve the quality and accuracy of the information here by sending us feedback.

[Contents ↑]

Call for translations

If you find this document useful and want to provide a translation in your native language, write us.

Let Antezeta help you in the selection, implementation and usage of a Web Analytics solution!

Contact us to find out more about this topic and the rest of the Web Ecosystem.

Home · Contact Us · Site Map & Search · Keyboard shortcuts · Top ↑