The banner displayed above this sentence is a "Google AdSense" advertising


Creative Commons License
This work is licensed under a Creative Commons Attribution 2.5 License.

Other texts



Google: which strategy? (2007, April - 2008, January)


Hosting Wikipedia may benefit to Google

Context

Many webmasters (and content editors, but let's simplify) let Google programs run on their websites, triggered by every visit and obtaining many informations about the visitor because they interact with his browser. Those webmasters do it because those programs give them various useful functions: Google AdSense "displays" advertising banners (the webmaster earns a fraction of the fees paid by the advertiser), YouTube provides embedded videos, Google Maps offers embedded maps, Google Analytics gives superb tools for analyzing a website activity, numerous gadgets enhance his site ...

Those Google programs can "mark" each visiting browser with a unique cookie stored in it. Upon each browsing of such a Google-instrumented web "page" the triggered Google program can read and write cookies, then record in one of Google's databases that "the browser storing this cookie browsed this URL, at this date". This enables Google to track the visitor, to know about his browsing habits (which 'pages' did he read? when? how long? on which links did he click? where is the visitor (various commercial services map many IP addresses to countries/cities)...). On many cases Google even has the visitor's identity because he registered on some Google offering (thus enriching the informations in Google databases associated to his/her Google 'cookie'). Even without being able to obtain the visitor's identity, Google knows that "the browser used to view this site on this date was also used to view this other one".

This tracking is efficient because the typical Web user uses "his own browser" (his own set of browser configuration and data files, actually). In other words: very few persons "share" a single browser setup or often purge the cookies stored in their browsers (because it is somewhat inconvenient as, for example, it forbids accessing to some websites).

Many visitor-tracking programs, on many websites, do track visitors. But no one seems to be as pervasive as Google's.

Google mainly earns money thanks to their AdWords offering (selling ad banners), which uses AdSense. For now AdSense populates an ad banner by links to sites which use terms also used in the web "page" where the banner is to be displayed. The reasoning is "if the visitor landed here, the very topic of this document is of interest to him, therefore any related advertising is pertinent".

Speculative thinking begins here!

How can Google enhance his toolchest?

Such advertising do, or at least "may", gain efficiency by selecting ads not only by the current page (or search) content but with respect to the visitor's history, by analyzing how much time he spent on each page he visited, in which ways any ad he clicked on is different from the other ones which were displayed... This approach (using its own scope to enhance itself) is not very efficient because it induces latencies: AdSense only discovers an interest after some time, and during this process the visitor thinks that the banners are not very interesting and begins to ignore them.

In order to accurately choose banners, Google advertising software may gain from quickly knowing the somewhat changing preferred topics of any visitor, along with their relative importances.

Google search service
is probably their most efficient tool now, because many people use it everytime they search for something, thus letting Google know what they are looking for. But it has competitors and doing anything immediately efficient against them is not easy. Google probably began to exploit it in 2007 (the Pulpit wrote: "Google Personalized Search now uses the terms from previous searches to help fine-tune the next search")
Google Mail (gmail)
by enabling them to analyse the contents (words used in mails) and the traffic (who sends mail to whom? They probably have somewhat common tastes...), enhances discovery of preferred topics along with identification (Google often obtains, when the visitor registers on GMail/AdSense/AdWords/Analytics, his main email address. Then it tracks pseudonyms used on the same browser thanks to cookies). But GMail is not the mail-service used by most users, albeit it tries to reach this status, probably because of privacy concerns which are more and more adressed in way forbidding contents analysis.
Thingies enhancing the Web experience, for example the Browser Toolbar
some functions offered, for example Web History, are so clearly able to track people that many will not use it.
Other Google services (YouTube, Orkut...)
many let the user select favorite materials or describe himself, storing data potential sources for intelligence... but many Web users don't use those services.

How can Google gain more intelligence, faster, on a large portion of all Web users of all types?

Wikipedia

Google may decide to analyze visitor activity on a site considered by a fair and growing fraction of Web users as a repository of useful generic information, a site where many of them immediately search for information on their topic du jour, a site where Google, for now, cannot track what they do.

The most prominent one is Wikipedia which, albeit being far from perfect (beware: document written in French), attracts experts (wanting to check any information published), vendors (searching for hints and promotion) and Web users (looking for information) altogether.

Google can not hope that Wikipedia web servers administrators will let them track the visitors but knows that this can be done without any intrusive action by hosting it (meaning: letting the website run on Google's resources, by providing machines, network connection and some human work), enabling them to listen to all network traffic. Google will be able to collect various information (visitors IP and mail addresses, browsers characteristics, Wikipedia cookies...), very useful to track the visitors and both "leveraged by" and "leveraging" similar information collected by other means, enabling Google, for example, to enhance visitor's identification. This is of interest even if Google hosts a small subset of Wikipedia webservers and proxy cache servers (because of the round-robbin setup), and doing it adequately can lead it to manage all of them, thus obtaining all visitors information.

The Google indexer (search service) already masters a fair part (probably at least than 30%) of Wikipedia incoming traffic (this leadership was clear in 2002 and there is no reason to think that this trend ended), but it simply cannot fetch intelligence on what a visitor does as soon as he interacts with Wikipedia's interface. Worse: Google cannot know even on which article a visitor lands either:

Google cannot hope tracking this through AdSense because most contributors don't want advertising. Analytics may do the job but Wikipedia admins don't seem to be interested.

By managing to host Wikipedia, Google will gain a good insight (first-hand and in real time) into many visitors topics of interest, because Google will know who reads or writes articles published by this encyclopedia. Therefore this company will gain a competitive advantage by more accurately selecting AdSense and AdWords advertisements. By learning that you browse Wikipedia among articles about "solar cells", for example, Google will immediately infer that you may be in the process of collecting information before buying one.

For example: some categories used in Wikipedia articles reflect families of commercial products. Therefore the subset of categories used in most articles successively read by a given visitor in a short period... will help to target ads towards products related to his current need.

In a not-so-distant future the very fact of qualifying what is of interest for a given visitor may be a major asset, especially if he is also identified. Efficient ways to circumvent such maneuvers, for example using Tor, will then gain momentum. By hosting Wikipedia Google will nearly immediately gain a way to do such magic, for a huge and growing proportion of Web users. It will adequately be considered by much as community-friendly move, ensuring a comfortable marketing flash.

On the technical side of things, hosting will also enable Google to better and faster index Wikipedia, to easily consolidate various sources of tracking information because (as far as I know) many user accounts on Wikipedia are created by giving a GMail address, to attract more incoming traffic on its network (think peering and projects directed towards the client machine...), to peek into the effect (on visitor's behavior) of some semantic Web, to spare their resources used to cache Wikipedia...

Google may offer to host Wikipedia, giving a real gift and simultaneously earning from it.

Wikipedia needs hosting

Surprise: Wikipedia needs hosting and software development (or money to pay for them). The service is sometimes slow (a message such as "Wikipedia has a problem ((...)) All servers busy" replaces the content) for various reasons, therefore new machines may be useful (approximately: 300K US Dollars were spent in Q1 2007, 450K USD in 2006, 310K in 2005). Many functional improvements to the software are postponed (or not even considered) because the developers do not have enough time: the roadmap shows, through the milestones, fewer and fewer functional improvements and more and more performance-related ones. This is coherent with their declarations (2005, October) to Sourceforge: "We've sometimes had to abandon features that couldn't scale" / "Our main concentration tends to be on performance"...

Google offered to host the project (without any modification nor advertising insertion!) just after a major hiccup, in 2004: its founders met Jimbo Wales, Wikipedia co-founder. The news were then, slightly edited (spelling):


Jimmy Wales --Wikipedia founder-- met with Sergei Brin and Larry Page, who were extremely enthusiastic about the whole project. Google plans to donate a certain number of 'Dual Xeon' servers at one or more of their data centers and with unlimited bandwidth. In some days Wikimedia Foundation Board is to discuss --via IRC-- the agreement with Google, who aren't to ask anything in return (ads on Wikipedia pages).

Wikipedia needs resources and Google wants to offer it... but the deal languished 'till 2005 and is not closed (May, 2007), more than 3 years after the first official meeting. Yahoo and Kennisnet offered partial hosting (in Korea and the Netherlands), which is in action.

Jimbo Wales probably knows that a Google deal may critically hurt some of his projects (millions of bucks to establish a rival to Google by applying the Open Source and transparent ideals of Wikipedia to a search project, using Grub (which in fact seems to be some Nutch hack(?)), a Google-clone in its fundamentals which harnesses some resources of the volunteers own machines. If this project succeeds it will create a competitor for the search service proposed by Google.

Wales' plan may be to let Google think to Wikia as a target, as a community and technology potentially dangerous and in any case useful to acquire. He may work in order to discourage the Wikimedia Foundation from accepting Google's hosting, with persuasive arguments: linking the future of Wikipedia to a commercial entity is dangerous because it merges their fates and their shareholders can go berserk, moreover no one can hope salvage such a hosting solution many years after having it delegated...

Given the state of Wikipedia budget (expenses soon to reach 3.3 millions USD per year and revenues of 1.6 millions USD in 2006), however, one may think that such a deal will be hard to dodge.

Google answer to Wales is surprising: Google-search results now often begin with a link to a Wikipedia article. This is a way to have Wikipedia users prefer, for their usual quick-searches, Google search over the Wikipedia's integrated search, because most will understand that a Google search has a richer set of results while allowing quick access to Wikipedia. This enables Google to somewhat track Wikipedia's users behavior while raising its relative importance as a referer to Wikipedia.

Google announced (December 2007) a Wikipedia competitor dubbed Knol. If Google cannot host Wikipedia, let's bet that the links to Wikipedia provided in most Google-search results will be replaced by links to Knol.

Potential side-effect

Many websites publishing content pertinent for Wikipedia will see it copied there (especially free contents). Then their audience will decline because Google, preferring Wikipedia, will lead fewer and fewer visitors to them.

Summary

Google position is secure:

Conclusion

Google must try harder to host Wikipedia. Yahoo may move, but I bet that Google will win the case.

Notes

Google uses data collected during searches.

Google is in the process of consolidating (Single Sign On) the mess of user accounts historically created by each of its services, thus simplifying the tracking process.

'Google Analytics' even enables Google to know what a given tracked visitor publishes on the Web.

Cory Doctorow wrote a piece about "what if google were evil?": Scroogled.

Ack

Thanks to Muriel, technocrat.net and especially to S. Blondeel (Sbi).

Other texts