How to perform competitor research using web statistics while avoiding lies, damned lies, and …statistics?
Comparison with competitors is a fundamental element of business; even innovators need to know how far ahead they are in their market. The Internet seems to offer fertile terrain for capturing accurate marketing statistics on website usage and position relative to other players in a given market. Indeed, most of us have often heard web statistics from Nielsen//NetRatings, Alexa or comScore cited in the press and elsewhere. Practitioners of Search Engine Optimization and web marketing know that web analytics is not just silo analysis of a company’s website: it also entails looking at how a website and its business performance metrics measure up in the overall web ecosystem.
Statistics reliability
So how valid are the web usage and site ranking statistics offered by web measurement companies? Despite the existence of organizations such as the Web Analytics Association (WAA) and the Interactive Advertising Bureau (IAB), the ugly answer is “We don’t know“. What we do have is a proliferation of the “mine’s bigger than yours” syndrome as one website cites its high ranking using one set of statistics and a competitor responds, citing a different, incompatible, source.
Is there a statistician in the House?
Most of us don’t have a background in statistics. We easily fall for the authority of pretty pictures and lines that go up and down. Darell Huff covered the topic well in his 1954 classic, How to Lie with Statistics [UK edition]. That his book is still in print is a testament to the power of his message. More recently, Edward Tufte has tried to teach us the errors of our ways in his renowned The Visual Display of Quantitative Information. [UK Edition]
Methodology transparency
To be able to effectively evaluate the reliability of the Internet statistics published by the various measurement companies, we would need to have access to information on the complete methodology used to derive the final statistics. Basic elements include the sample selection, sample size, and any corrections made to offset sample bias and skew. Yet even when the major collectors of internet usage data do discuss methodology, they usually don’t get beyond sample size and a very general discussion of sample data sources. Indeed, the IAB has recently challenged two of the companies, Nielsen//NetRatings and comScore, to accept a standardized audit and accreditation process. In this challenge exists the risk that processes which are fundamentally flawed in their conception may end up being accredited simply because the method becomes documented.
Common measurement approaches
There are three principle ways to measure overall Internet usage. A panel of users can be measured at their computers with installed software (user-centric), marketers can monitor how visitors interact with a specific website (site-centric), or data can be collected directly from ISP networks (network-centric).
User-centric internet usage measurement
User based measurement involves convincing users to install software which will track most, if not all, their Internet usage. Browser toolbars are one of the most obvious approaches. Toolbars have many limitations. They are limited to tracking standard website traffic. They won’t know about Skype and other non browser based internet applications. Toolbar based measurement is limited to a self-selected population which decides to install the toolbar, although there are some cases where a toolbar may be bundled with a new computer. Toolbar suppliers often exclude Firefox and other non-Internet Explorer browsers.
A second approach is to convince users to install an application which tracks both browser and other Internet application usage. Panel based measurement takes this approach. In some cases the users are well aware of their participation in a panel (and may alter their behavior accordingly); in other cases, users have chosen to install software to receive a specific benefit – the user might not be aware of the software’s true purpose.
Website-centric usage measurement
Measurement at the website level involves cooperating with website owners to install a web analytics system, usually based on web server log files or on tracking code inserted in all of a site’s web pages. In Italy, Audiweb publishes website-centric data of major media companies (requires registration). Due to the need for site owners to cooperate, use of this approach is limited. Some web analytics hosting companies make limited website-centric data available as an activity secondary to their primary business.
Network-centric usage measurement
Intercepting data between users and websites at the network level potentially offers sample sizes which are much larger than traditional panels (but smaller than the data sample available through collection directly at a website).
The first consideration in evaluating the data’s potential reliability is to ask about the sampling methodology. Which ISPs have been selected? Is their typical user demographic different from those not included? Users of Telco ISPs, such as Telecom Italia’s Alice, are usually following the path of least resistance. Those who have chosen economical ISPs such as Tele2 will have a different marketing profile. At the oposite end of the spectrum are the ISPs which focus on premium high speed connections, such as Italy’s Fastweb. Do the selected ISPs include corporate traffic, or just small business and residential users?
The measurement companies
Alexa
Amazon.com’s Alexa is one of the better known supplier of web statistics, and one of the few offering broad international data. Data is collected through the Alexa Toolbar installed by those wishing to view site ranking data for competitor sites, rank the sites they visit, and more recently, perhaps due to incorporated anti-phishing tools. It not difficult to imagine that this self-selected community is comprised of web masters in the know looking to increase their own site’s ranking.
Alexa’s methodology is fairly well documented, including significant disclaimers.
Although Alexa should be lauded for documenting their methodology, the disclaimers inaccurately minimize the IE Windows centric nature of Alexa data. While Alexa states they collect data from toolbars on Macintosh and Linux platforms, Alexa fails to note that they offer just one toolbar: for Internet Explorer on MS Windows. While there are some third party tools for Firefox, it is doubtful that they have the same uptake percentage as the official toolbar for IE. One, the “About this site” extension, would only ping Alexa’s servers upon a manual rank verification request, certainly not the same as the automated data collection offered by the IE toolbar. One semi-official solution, the A9 toolbar, has been abandon by Amazon. As Alexa’s site and toolbar are only available in English, English speaking regions may be overrepresented.
Update: Alexa has released “Sparky”, a toolbar for Firefox, 16 July 2007.
Notable has been Alexa’s commitment to web developer API’s, allowing alternative services to provide alternative visual representations of Alexa’s base data. One example is Statsaholic, although it been accused of scraping Alexa’s data.
Alexa does offer some pretty icons to illustrate a site’s ranking in the Alexa community – particularly useful for those who have manipulated Alexa’s rankings.
corriere.it vs repubblica.it: page views of the past 6 months.
Antezeta considerations: Alexa’s data, based on a self-selected sample, lacks scientific rigor. We wouldn’t make financial investments based on Alexa rankings. Admire the pretty graphs and move on.
Compete
Founded in 2000 by Bill Gross of Overture fame, Compete employs a browser toolbar (Internet Explorer and Firefox supported), panels and ISP data to capture two million plus community members in the US. Sites whose audience is primarily outside the US are not officially supported; limited non-US tracking appears to have begun in July 2006 as seen in a comparison of visits to Italy’s two leading newspapers:
Compete’s methodology is vaguely discussed but not publicly documented.
Antezeta considerations: Judicious leveraging of data from three different sources could potentially lead to more reliable data than that provided solely by small panels or self-selected toolbar audiences. With a US focus, Compete data is not reliable for sites whose main visitor population is outside the US.
comScore
comScore uses a panel of over 2 million participants who have been recruited by software offering other benefits such as an E-mail antivirus and free prizes. comScore’s methodology has proved very controversial, with some labeling the now discontinued MarketScore antivirus software spyware. In theory this somewhat self-selected sample (it is not clear how many participants signing up for free software really understand the true motive behind the free software) is normalized to represent overall Internet demographics, but, as in the case of the other services, we’ll have to take comScore at their word on this.
Through a Canadian subsidiary SurveySite, comScore conducts quantitative & qualitative research online.
Antezeta considerations: From a marketing demographic perspective, is an educated, savvy Internet user going to install software of dubious provenance? Are companies with IT processes in place going to allow their employees to install this software?
Quantcast
Launched in September 2006 by “a team of engineers and scientists from NASA, Stanford and AltaVista“. Quantcast primarily collects data from ISPs and advertisers with a US focus. Site owners can tag their pages with Quantcast code for more accurate site level reporting. As with most web statistics suppliers, Quantcast produces some impressive reports, but the exact methodology used is not publicly disclosed. Quantcast do describe their overall approach as a mix of panal data sampling and pixel tracking. Unfortunately, the devil is, as always, in the details: how is the panel recruited? How representative is it of the public as a whole? Are panel members aware that their navigation will effect Quantcast’s data, and if so, does that influence their navigation? How is the data processed, and with what assumptions?
View Quantcast profiles for la repubblica and corriere della sera.
Antezeta considerations: Without full disclosure of Quantcast’s methodology, it is unclear if their results could withstand peer scrutiny. Benchmarking of sites using Quantcast tracking code should be fairly reliable assuming (big assumption here) site pages have been properly tagged.
Nielsen//NetRatings
Nielsen//NetRatings, long established as Nielsen in the audience measurement field, is perhaps one of the most cited source of web statistics in the general press.
Nielsen captures high level internet data using sample panels selected by the random digit dialing (RDD) method. In Italy, the sample size was just increased from 5,000 to 15,000 with a stated goal of 20,000 by the end of 2007. A press release also notes rather vaguely that the RDD selection is being augmented by on-line recruiting; it is not clear as to how nor why. It is this panel data which is usually referred to in press statements on overall Internet usage trends. Update 2009-03-28: according to a December 2008 Audiweb press release, the Nielsen panel size in Italy reached about 20,000 in October 2008. Not sure what they mean by about.
Nielsen also captures web analytics data at the website level, i.e. much more accurate, for clients who install Nielsen//NetRatings’ “SiteCensus” tracking code. SiteCensus is based on the former Red Sheriff product.
Nielsen//NetRatings offer a great sourcing Guidelines document for their clients and other consumers of their data but this document seems to be focused primarily on protecting Nielsen//NetRatings’ image and revenue streams rather than clarifying the underlying methodology and statistical error. Try to find a comprehensive discussion of Nielsen//NetRatings’ methodology used to justify their statements. Does Nielsen metering software run on Macintosh and Linux computers? Or does it track just Windows users? Does RDD call both cell and landline phone numbers? How many people refuse to participate? Are they more affluent, time pressed professionals who can’t be bothered? The best you’ll find is a marketing document touting “Nielsen//NetRatings’ precise information”.
So want does precise mean? Consider the Italian panel of 15,000 members. Italy’s population is about 59,000,000 (Source: Istat). Website X is leader in the home mortgage business, a very important sector both for home buyers and lending institutions. Website Y is a valid contender. Nielsen’s sample size is .025% of the entire Italian population, .038% of the 15-64 year olds. Yes, you understood correctly: Nielsen//NetRanking’s calculations as based on a whopping 0% of the population, after rounding. In the case of our Home Mortgage example, site X has ~120,000 unique visitors every month. Site Y, the closest competitor, attracts ~75,000 unique visitors a month. So how many of these real visitors can Nielsen//NetRankings track? About 46 and 29 respectively, assuming the site traffic and panel composition is between 15 and 64.
| Italy Population | Nielsen//NetRatings Sample Size | Sample as % of Population | Unique Monthly Visitors Site X | Number of visitors captured by Nielsen//NetRatings | Unique Monthly Visitors Site X | Number of visitors captured by Nielsen//NetRatings | |
|---|---|---|---|---|---|---|---|
| Adult (15-64) | 39,058,000 | 15,000 | 0.038% | 120,000 | 46 | 75,000 | 29 |
Antezeta considerations: Nielsen//NetRatings does not publicly release data for specific sites other than what appears in their press releases. Consider if a small panel is an appropirate measurement technique for the Internet.
Hitwise
Australian based Hitwise uses ISP data to report on Internet use in its primary markets: the United States, United Kingdom, Australia, New Zealand, Hong Kong and Singapore. While the main focus appears to be on data collected from ISPs, there is also a brief mention of “opt-in” data.
See the previous discussion of network-centric data collection to understand some of the advantages and limitations inherent in using ISP data.
A number of sample reports, such as top search engines, are available in Hitwise’s data center. Hitwise does not currently have partnerships with Italian ISPs, thus Hitwise does not report on the Italian Market.
Antezeta considerations: Hitwise does not publicly release data for specific sites other than in press releases and blog entries. Evaluate Hitwise’s sampling techniques for your markets before utilizing their data.
Netcraft
Netcraft has tracked web server statistics across the internet since 1995. In December 2004, Netcraft began to track website popularity through a user installed toolbar, promoted as an anti-phishing tool. According to Netcraft, the site ranking is based on the weekly hit rate. Netcraft estimates that the toolbar usage is in the hundreds of thousands and reflects the general population of the internet.
Antezeta considerations: Data suffers all the limitations of a toolbar based self-selected sample.
Ranking.com
Ranking.com collects data via a browser toolbar which offers a “trust gauge”, site ranking information and a “browser accelerator” (search engine search box) among its features. The toolbar is only available in English for Microsoft’s Internet Explorer on Windows. The tech savvy Firefox crowd is not part of this demographic nor are Macintosh aficionados. Data methodology discussion says updates are monthly.
Antezeta considerations: Data limited by self-selection methodology and Internet Explorer only support.
Website ranking services at a glance
| Company | Since | Primary metrics (publicly available) | Sample Size (world-wide) | Sample Selection Methodology | Click data sources | Modifiable Site Profile? | Data API | Geographic Scope |
|---|---|---|---|---|---|---|---|---|
![]() |
1996 | Rank and Reach | “an installed based of millions of toolbars”; 180,000 (third party estimate) | Self selection | Browser Toolbar (IE / Windows only) | Yes | Yes. Fee based. | World-wide |
![]() |
2000 | Visitors, Engagement | 2,000,000 | Self selection | Browser Toolbar | No | Yes. | US |
![]() |
1999 | Visitors, Rank | 2,000,000 | Self selection | User installed software / spyware from opinion square, Permission Research and potentially others. | No | No | US, Canada, UK, France, Germany and unspecified others. |
![]() |
1997 | Rank, market share | 25 million Internet users for the markets they measure. | Hitwise ISP agreements | ISPs | No | No | US, UK, Australia, New Zealand, Singapore |
![]() |
1995 | Rank | not disclosed | Self Selection | Browser Toolbar (IE, Firefox) | No | No | World-wide |
![]() |
1997 | not disclosed (15,000 in Italy) | Mostly RDD? | Opt-in Panels | No | No? | World-wide | |
![]() |
2006 | Rank | “1.5 million U.S. Internet users, working to grow that by an additional one million in the near term”1 | Mixed | panels, ISPs, web site publishers | Yes – demographic profile; measurement method. | Site name and ranking for top million sites can be downloaded | US focus. International data provided for publisher submitted sites. |
![]() |
1998 | Rank, Trust Gauge, Links | 215,000 | Self selection | Browser Toolbar (IE / Windows only) | Yes | No | English centric due to English only toolbar. |
1E-mail from Krista Thomas, Vice President, Marketing Communications, Quantcast
This data was complied from information on the services respective websites, augmented in a few cases by queries to a company. Please do contact us with updates and clarifications. Last update: April 2007.
Additional Web Statistics Data Sources
The following are additional sources of web analytics data. Bear in mind that each of these sources is probably subject to limitations based on many of the issues discussed earlier in this article.
Blog measurement
BlogBabel
BlogBabel, covering the Italian and Spanish markets, aggregates third party blog metrics to arrive at a BlogBabel ranking.
FeedBurner
FeedBurner, now a Google property, manages the RSS feeds for many blogs. RSS Feed usage statistics are available to site owners and if enabled by the site owner, via a FeedBurner API.
Technorati
Technorati’s authority ranking of a blog is classified as the number of incoming links to a blog in the last 6 months. The 100 most popular blogs are listed in order.
Click Fraud
Click fraud, also called “invalid clicks“, is a term used in pay per click marketing to refer to clicks on links intended to defraud the advertiser paying for the proportional link. One company, Click Forensics, publishes a Click Fraud Index™. As you might imagine, not everyone thinks these numbers are accurate.
Web Analytics Suppliers
By virtue of tracking most, if not all, clicks on a client’s website, suppliers of hosted web analytics systems have an excellent view of how a segment of internet sites are performing. Data sampling is limited to companies which have selected to use one of these hosting services.
Fireclick
Fireclick provides web analytics services to clients in a variety of market segments. Selected Business and Marketing Conversion metrics derived from Fireclick customer data are published on a weekly basis. Business measures include Shopping Trolley / Cart Abandonment, global conversion rates and visit information. Marketing conversion data includes e-mail, keyword and affiliate program tracking.
Shiny Stats
A Web Analytics provider based in Italy, Shiny Stats publishes a daily ranking of top sites based on visits to sites using Shiny Stats tracking. Data can be viewed by category or by a free text search on a site’s description in the Shiny Stats system.
Audiweb
Aggregate data from major portals and media properites in Italy are published by Audiweb. (In Italian. Registration required.)
Google Analytics
Google Analytics began to support benchmarking of a site’s web statistics against various industry categories in March 2008. Benchmarking data is only available for sites which have opted in to this program. (added 2008-03-06)
Showdown: How do Web Ranking Services Rank each other?
For each of the public ranking services listed in the left column, see how they rank across different services. Sort by a column to rank the rankers by a specific ranker!
Survey 2007-04-16. Compete: value for march 2007
Technical Website Performance Benchmarks
At first glance, technical benchmarks may seem to be more of an IT purview. Yet marketing professionals should be concerned as well. A slow loading website makes for a frustrating user experience. Google advises website page loading time will be considered when assigning an Adwords Quality Score to a landing page, certainly a point of acute interest to search marketing practitioners. It isn’t too hard to imagine page loading time impacting organic search results as well. Many companies provide website technical performance measurement and monitoring; the Apdex (Application Performance Index) membership is a good list. Few offer publicly available benchmark data. Section added 2008-03-23.
AlertSite Market Index
AlertSite offers limited benchmarking data for the Computer, Financial Services, Information Services, Manufacturing, Retail and Telecommunications markets in the US.
Gomez Website Performance Benchmarks
Gomez calculates technical performance benchmarks for response time, availability, and consistency for selected websites in some of the most popular business sectors on the web. Data is available for Canada, China, Germany, the UK and the US. Gomez simulates typical business transactions, such as looking for a hotel room, from diverse geographic regions. To understand the exact steps Gomez measures for a given sector, you’ll probably need to contact Gomez.
Keynote Industry Benchmarks
Keynote provides a similar service to Gomez in the US and UK. Data limited to “top firms” (mostly national) is also available for the Benelux, French, German, Portuguese and UK markets. According to Keynote, sites selected
should be part of the index based on the following criteria: online brand awareness, third-party published traffic to the site, ability to be measured by Keynotes measurement computers and percentage of revenues driven via the online channel.
Conclusion
As the above overview demonstrates, there are many sources of Internet statistics. Unfortunately, most of these numbers fail to offer any degree of confidence as to their reliability – rendering them practically useless. Few of the documented methods employ scientific rigor (e.g. they exclude all Internet users who don’t use Internet Explorer, etc.); worse still, many suppliers don’t document their methods to any degree that lends itself to outside validation.
In the absence of anything better, there’s a strong temptation to think these statistics might be better than nothing. Yet, numbers derived without scientific rigor are worse than nothing as they provide a false sense of confidence in business decision making, a confidence which lacks a solid foundation based in reality.
It is our hope that industry associations like the IAB and the Web Analytics Association (WAA) will spotlight the need for modern methods and accountability in internet statistics gathering and reporting.
Feedback and Comments
We welcome your feedback on this article. If you have information to help us clarify or improve any part of it, please contact us directly. Should you require web analytics or search engine optimization consulting, please let us know.
Similar Posts:
- Are we still lying with statistics in the internet age?
- Eying Search Engine Market Share in the era of Bing
- Google Analytics’ Web Statistics Benchmarking Service
- Social media measurement and an example, this SEO Blog
- 7 sources of link intelligence data and key link analysis considerations
If you're new here, you might subscribe to my feed by Email, RSS feed and/or follow me on Twitter, which is updated on a more frequent – and more meaningless – basis in English and Italian. Finally, if you're a Sphinn user, Sphinn love is welcome :-). Thanks for visiting!
Share











3 responses so far ↓
1 Martin // May 18, 2009 at 15:46:12
A bit late comment, but:
Especially when the companies themself generate alot of traffic, see this post http://www.google.com/support/forum/p/Google+Analytics/thread?tid=19f9a99ce2106290&hl=en&fid=19f9a99ce210629000046a2fa9c5d87d
and this image showing the latest months of traffic http://img199.imageshack.us/img199/8459/nielsen2.png the last week it was about 6000 visits to a site with about 50000 visitors that week. That is insane..
2 Lala // Oct 13, 2009 at 11:57:29
I have found another website traffic and value estimator site. I’m talking about http://www.estimix.com. The estimation provided by estimix is the result of a complex analysis based on factors like: the age of the website, the demographic structure of the traffic, the countries where the website is popular and sources of the traffic.
3 Firewalle // Mar 2, 2010 at 10:37:24
Hi! You should also consider http://www.surcentro.com They provide a nice summary of the website performance. I trust that you’ll find this very useful cause it seems to use the Alexa traffic information quite well and provides much better traffic information.
Leave a Comment
Warning: Comments are welcome insofar as they add something to the discussion. Anonymous and/or polemical comments without a rational justification of the author's position risk being mercilessly deleted at the sole discretion of the administrator. Yes, life is hard :-).