| Index | Last Week | Contact DQ News | Latest Issue |
| Featured Articles | Data & Information | Science & Technology | Education |
| Economics | Business | Law | Medicine |
Web's Vastness Foils Even Best Search Engines
According to a paper published in the April 3rd issue of Science, even the most thorough search engine manages to find only about a third of the pages on the Web. The major search engine companies have often claimed that they can keep up with the size of the Web, that is, they can continue to index close to the entire Web as it grows. But the Web is a distributed, dynamic, and rapidly-growing information resource, which presents difficulties for information retrieval technologies. The nature of the Web brings up important questions as to whether the centralized architecture of the search engines can keep up with the expanding number of documents, and if they can regularly update their databases to detect modified, deleted, and relocated information.
Currently available estimates of the size of the Web vary significantly. The lowest estimated the paper's authors cite is 75 million pages. The figure the authors use is 320 indexable pages, as of December, 1997, calculated from the overlap between the two largest search engines - Hotbot and AltaVista. [Editor's note: America Online makes software and storage available to its 11 million account holders that enable them to construct up to five Web pages per account. Other Internet service providers offer similar features.]
The paper's authors produced statistics about the coverage of major Web search engines, the estimated size of the Web, and the recency of the search engine databases. The authors conducted their Web queries between December 15th and 17th, 1997. The search engines included HotBot, AltaVista, Northern Light, Excite, Infoseek, and Lycos.
The authors analyzed 575 queries that satisfied their search constraints. Using the estimate that the indexable Web contains 320 million pages, the percent of the indexable Web that individual engines cover was: HotBot, 34%; AltaVista, 28%; Northern Light, 20%; Excite, 14%; Infoseek, 10%; and Lycos, 3%. The percentages of invalid links ranged from 1.6% to 5.3%. Using two or more search engines significantly increased coverage. Combining the six engines in the study retrieved about 3.5 times as much from the Web as one engine. The study did not investigate the timeliness and validity of the information and data contained in Web pages. The study's authors were Steve Lawrence and C. Lee Giles, of the NEC Research Institute in Princeton, New Jersey. The study appears on page 98. A related article by Thomas E. Weber appears in the April 3rd issue of The Wall Street Journal.