How big is the Web? How much of the Web do the search engines index? How up to date are the search engines?

Steve Lawrence and C. Lee Giles, NEC Research Institute

New 1999 study on the accessibility and distribution of information on the Web

These are some of the questions addressed by our article in Science: Searching the World Wide Web, 280, p. 98, April 3, 1998.
Answers to these questions impact on the best search methodology to use when searching the Web, and on the future of Web search technology.
The results on this page are as of December 1997.

@Article{ lawrence98searching,
       author = "Steve Lawrence and C. Lee Giles",
       title = "Searching the World Wide Web",
       journal = "Science",
       volume = "280",
       pages = "98--100",
       year = "1998"}

Conclusions

The Web search engines are very important and useful resources, and are playing a major role in the information age. However, the search engines are currently lacking in comprehensive and timeliness. The current state of search engines can be compared to a phone book which is updated irregularly, and has most of the pages ripped out.

320 million pages

An estimated lower bound on the size of the indexable Web is 320 million pages.

Coverage varies dramatically

The coverage of the major Web search engines investigated varies by an order of magnitude (variation will differ for different queries, e.g. more popular queries).

Only a fraction of the Web indexed

The major Web search engines index only a fraction of the total number of documents on the Web. No engines indexes more than about one third of the "publicly indexable Web".

Combining multiple engines increases coverage

Combining the results of multiple engines can significantly increase coverage. Combining the six engines in this study covers approximately 3.5 times as much of the Web as one engine on average (about twice the coverage of the largest engine).

Indexing patterns vary

The indexing patterns of the engines varies significantly over time, and the engine with the most recent pages may not be the most comprehensive engine.

Engines limited?

The engines may be limited by network bandwidth, disk storage, computational power, or a combination of these items (despite claims to the contrary).

Coverage

We performed a rigorous statistical study by analyzing the search engine responses for 575 queries made by employees of the NEC Research Institute. We did not simply compare the number of documents returned by each engine, results from such a study would provide inaccurate estimates of the coverage of the engines. Instead, we downloaded and analyzed every single page (about 150,000 pages) that each engine listed, in order to enforce a consistent relevance measure across all engines (otherwise some engines return documents with related terms or documents that no longer exist which would make the results inaccurate).

By analyzing the overlap between engines we estimated a lower bound on the size of the "publicly indexable Web" at 320 million pages (see below for more details). The "publicly indexable Web" excludes pages typically not indexed by the major search engines, e.g. pages behind search forms or authorization requirements. The following figure shows the estimated coverage of six major Web search engines compared to the estimated size of the Web.

Web coverage


For an indication of the stability of the results versus the number of queries used, see the coverage estimates versus the number of queries (a random subset of the queries was chosen for each point on the graph).

Coverage for what kind of queries?

It is important to note that the queries used in the study were from the employees of the NEC Research Institute. Most of the employees are scientists, and scientists tend to search for less "popular", or harder to find information. This is beneficial when estimating the size of the Web with the technique in our paper (see below). However, the search engines are typically biased towards indexing more "popular" information. Therefore the coverage of the search engines is typically better for more popular information.

Recency; freshness; invalid links; median document age

The following figure shows the percentage of invalid links for six major Web search engines. When comparing these results with the results of similar experiments performed in August 1997, the ranking of the engines in terms of the percentage of invalid links changed significantly. Analysis of the median age of documents returned by the engines shows similar changes from the experiments performed in August 1997. These results suggest that the indexing patterns of the engines varies significantly over time, and that the engine with the most recent pages may not be the most comprehensive engine (one factor involved here may be a tradeoff between the database size and update frequency).

Web coverage


Estimating the size of the Web

How is the size of the Web estimated? Discrete multivariate analysis is used to generate an estimate from the overlap between the individual search engines. The estimate is biased because of the dependence between the engines in terms of the pages they choose to index (e.g. people often register pages at multiple engines, and the engines are typically biased towards indexing more "popular" pages). This makes our estimate a lower bound. We consider this issue carefully, e.g. in terms of the which engines are used for producing the estimate. The fact that the queries used were mostly from scientists, and for harder to find information, improves the accuracy of the estimate (if popular queries were used the estimate would be biased lower). Studies which do not consider these issues will typically produce a lower estimate of the size of the Web.


Web coverage


Other issues and future studies

There are many other issues not discussed here, for example other ways to compare search engines (e.g. relevance, query interface), and suggestions for searching the Web depending on the kind of information desired. More information can be found in the related interviews and press articles, and our tips for searching the Web. For full details of the study, request a reprint of the article.

Future studies will update and extend the results shown here. Check back for updates.



Search Engine Watch: News, tips and more about search engines, by Danny Sullivan
Copyright © 1998 NEC Research Institute