← Autodidact Archive · Original Dissent · hqz
Thread ID: 7279 | Posts: 1 | Started: 2003-06-11
2003-06-11 21:32 | User Profile
[url=http://www.google-watch.org/broken.html]http://www.google-watch.org/broken.html[/url]
Is Google broken?
It's behaving as if it has a Y2K+3 problem
June 9, 2003
For the last 30 days, Google has been doing strange things. No webmaster who follows Google closely will deny this. There is no explanation from Google apart from some vague hints from "GoogleGuy," an anonymous poster at webmasterworld.com, whom the forum owner says is from Google. [Google News Forum: [url=http://www.webmasterworld.com/forum3/]http://www.webmasterworld.com/forum3/[/url]] These hints claim that new algorithms are being put into place, and that this will take a couple of months.
Another possible explanation, entirely speculative, is quite intriguing and fits the facts better than GoogleGuy's vague assertions of an algorithm change. This explanation was obliquely denied by GoogleGuy, at which point the pro-Google forum owner, Brett Tabke, moved the thread off of the active list. The next day this thread was locked against further postings. Since GoogleGuy has been known to make strong denials on other matters, this amounts to a nondenial.
We'll call this intriguing explanation the "Y2K+3 problem." It's all the more intriguing because Mr. Tabke, an able programmer, made a misleading posting to the effect that adding another byte to the webpage ID number to fix an integer overflow problem was "child's play," and called the thread "bogus." The behavior of GoogleGuy and his crony Mr. Tabke look like a cover-up. Hopefully this Google Watch summary page will encourage more discussion of what's going on at Google, even if webmasterworld.com won't.*
First, an explanation of what's different at Google over the last 30 days. This sort of behavior is entirely unprecedented in the 2.5 years we've been following Google:
There were some strange results observed when the previous update kicked in on April 11, but webmasters became universally concerned in May, 2003. It took about two weeks for the May update to propagate to the nine data centers. Normally it takes around five days.
The May data showed half the number of backlinks that webmasters were accustomed to seeing on their sites.
The May data was from a previous update in February or March, and not from the usual deep crawl that occurred in mid-April. It was as if the April deep crawl had been thrown out. GoogleGuy essentially confirmed that this was the case. Pages that had been created since February or March were, as often as not, missing.
Slowly, the freshbot began to add more recent pages once the May data settled. But as is typical with freshbot data, these pages persisted in the index for two or three days, and then dropped out of the index. Then they'd pop up again a few days later. This yo-yo effect got the attention of a lot of webmasters who weren't used to this sort of instability. Freshbot normally handles fresh pages this way, but all of a sudden, the definition of "fresh" covered about a three-month span. Usually the freshbot has a dramatic effect only on pages that are new since the last deep crawl.
The PageRank of these "fresh" pages is indeterminate. It takes a deep crawl plus several days of calculation to compute the PageRank for the entire web. Without the April deep crawl data, the PageRanks that appeared were from February or March. Any "fresh" pages since then showed no PageRank at all. Normally the toolbar would approximate the PageRank for fresh pages, based on the PageRank of the home page, but it stopped doing this also. All it showed was an all-white or gray bar. This also attracted a lot of attention from webmasters. The more crawling the freshbot did, particularly on home pages for dynamic sites such as blogs, the more webmasters noticed that their PageRank was no longer showing on the toolbar. It is still unclear whether the real PageRank (as opposed to what the toolbar shows) is also indeterminate. The only way to judge this is by the rankings for selected keywords. Some webmasters reported massive fluctuations in rankings, suggesting that the PageRank component of the ranking algorithm was indeed broken. This was in addition to the fact that the pages would drop in and out of the index.
The deep bot has not appeared since the end of April. It is unprecedented for Google's deep bot to be AWOL for six weeks running.
The Y2K+3 theory
Let's speculate. Most of Google's core software was written in 1998-2000. It was written in C and C++ to run under Linux. As of July 2000, Google was claiming one billion web pages indexed. By November 2002, they were claiming 3 billion. At this rate of increase, they would now be at 3.5 billion, even though the count hasn't changed on their home page since November. If you search for the word "the" you get a count of 3.76 billion. It's unclear what role other languages would have, if any, in producing this count. Perhaps each language has it's own lexicon and it's own web page IDs. But any way you cut it, we're approaching 4 billion very soon, at least for English. With some numbers presumably set aside for the freshbot, it would appear that they are running out available web page IDs.
If you use an ID number to identify each new page on the web, there is a problem once you get to 4.2 billion. Numbers higher than that require more processing power and different coding. Our speculation makes three major assumptions: a) Google uses standard functions for the C language in their core programming; B) when Google's programs were first developed four or more years ago, a unique ID was required for every web page; and c) it seemed reasonable and efficient at that time to use an unsigned long integer in ANSI C. In Linux, this variable is four bytes long, and has a maximum of 4.2 billion before it rolls over to zero. The next step up in numeric variables under Linux requires different standard functions in ANSI C, and more CPU cycles for processing. When the core programs were developed for Google several years ago, it's reasonable to assume that the 4.2 billion upper limit was not seen as a potential problem.
It's also possible that the freshbot, which began work in August 2001, was assigned its own block of unused ID numbers from the pool of 4.2 billion. This would allow for easier integration of the freshbot data with the deep crawl data, but would also suggest a certain amount of inefficiency in the use of available numbers. Indeed, freshbot behaves as if it can only hold a certain number of total pages, at which point it drops the oldest page in its round-robin queue and overwrites it with a newer page from a different site. This peculiar habit could be another symptom of the limited pool of ID numbers available.
Any way you cut it, if you assume that every unique web page needs a number from the 4.2 billion pool before it can be included in Google's index, then either Google has problems right now, or it is anticipating problems in the near future and is beginning to overhaul parts of their system. This could explain the recent behavior of Google.
Despite Brett Tabke's opinion, it is not child's play to add a byte and make it a five-byte integer instead of a four-byte integer. Using an extra byte basically means that you use an unsigned character as a multiplier for 4.2 billion, and add that onto the integer to get your final result. If the integer is A, and the multiplier is B, the new integer is A + ( B * 4.2 billion ). If B is zero, then you get A. If B is one, you get values from 4.2 to 8.4. The point is that you need an extra multiplier variable, and extra lines of code for conversion. (You could use an eight-byte number in Linux, but this might be less efficient computationally and obviously uses more bytes than you need.) There are no standard functions in programming languages that handle five-byte integers.
For something as basic as a web page ID number, this implies a massive overhaul of Google's software. Meanwhile, you have 15,000 Linux boxes to upgrade. Every Linux box at Google reportedly runs a standard set of software, so that it can be interchanged with other boxes once the data is duplicated. It is safe to assume that the web page ID number is ubiquitous on all of these boxes. The upgrade has to be done in phases, otherwise all of Google goes off-line for a time. Performance degradation is tolerable, given the circumstances, but going off-line would mean an instant loss of market share. Adjustments in data capacity for the various indexes have to be made also, due to the extra byte per web page ID.
As we stated at the beginning of this page, all of this is speculative. Since Google won't tell us what's going on, we're forced to encourage wild speculation in the hope that Google will see it as in their interests to be more forthcoming about what they're up to.
Google believes that it's none of our business. But they're wrong. It is our business. We didn't ask Google to become the world's information broker; it just happened that way.
A search engine primer: What is an inverse index?
There is an amazing amount of denial from webmasters who aren't programmers, as well as some disinformation from those who know better. This is understandable. In the first instance, to say that Google needs to phase in an extra byte, and that this will be disruptive and inconvenient, is like saying that God didn't know about entropy and failed to plan for it. In the second instance, it's predictable that Google, Inc. wants to squelch rumors of a software overhaul. In the current competitive search engine environment, with Yahoo, Overture, and AskJeeves / Teoma positioned to move forward, the news that Google is making repairs is enough to forestall any plans for an immediate IPO. The first reaction is based on programming ignorance and blind pro-Google cultism. The second is driven by big money.
We cannot fight these extra forces at work, but we can explain what an inverse index is, and how it works. All full-text search engines work the same way, and all computers use ones and zeros. Anyone who claims that Google uses web page IDs that are plenty long has no idea of how an inverse index works. One poster looked at Google's URL for their cache copies, and concluded that the string of 12 alphanumeric characters, upper plus lower case, gave Google 62 to the 12th power for their web page ID, which leaves plenty of room for expansion.
This 12-byte string may contain a web page ID within it, but it may also contain locater information for the cache copy, as well as some flags. The web page ID variable we're talking about is called the docID in "The Anatomy of a Large-Scale Hypertextual Web Search Engine" by Sergey Brin and Larry Page. Every new URL gets its own docID.
Google indexes in many languages, but let's pretend we have just one language, and in all of the 99 documents we want to index, there are only five different words we want to index. We set up an inverse index by arranging the words, and after each word, we point to all of the documents that contain that word:
A-word: 05 02 87 33 B-word: 41 33 64 C-word: 56 D-word: 21 54 87 96 82 E-word: 32 55
The numbers above point to specific documents within our set of 99. Every word on every web page is indexed, and each of these words includes a list of pointers to every document that contains that word. For example, Google shows 6 million page hits for the word "britney." If they use four bytes for the docID, that means this single word requires 24 million bytes of memory or hard disk space to hold the pointers to those 6 million pages. That's for one word. Now multiply that by all the words Google indexes in each of several dozen languages.
You can see why the docID has to be as tiny as possible. Using four bytes, which means 32 bits consisting of either a one or a zero, you get a maximum of 4.2 billion possible numbers. It is so important to use as few bytes as possible for the docID, and the length of four bytes is so obvious for Linux in terms of efficiency, that we're certain that at some point, the length of the docID was set at four bytes. The only questions that remain are when did Google start implementing plans to expand to five bytes, and how disruptive will this conversion be if, as we presume, it hasn't already happened?