How big is the Web? How much of the Web do the search engines index? How up to date are the search engines?
These are some of the questions addressed by our article in Science: Searching the World Wide Web, 280, p. 98, April 3, 1998.
Answers to these questions impact on the best search methodology to use when searching the Web, and on the future of Web search technology.
The results on this page are as of December 1997.
@Article{ lawrence98searching,
author = "Steve Lawrence and C. Lee Giles",
title = "Searching the World Wide Web",
journal = "Science",
volume = "280",
pages = "98--100",
year = "1998"}
|
Conclusions
The Web search engines are very important and useful resources, and
are playing a major role in the information age. However, the search
engines are currently lacking in comprehensive and timeliness. The
current state of search engines can be compared to a phone book which
is updated irregularly, and has most of the pages ripped out.
320 million pages |
An estimated lower bound on the size of the indexable Web is 320 million pages. |
Coverage varies dramatically |
The
coverage of the major Web search engines investigated varies by an
order of magnitude (variation will differ for different queries, e.g.
more popular queries). |
Only a fraction of the Web indexed |
The
major Web search engines index only a fraction of the total number of
documents on the Web. No engines indexes more than about one third of
the "publicly indexable Web". |
Combining multiple engines increases coverage |
Combining
the results of multiple engines can significantly increase coverage.
Combining the six engines in this study covers approximately 3.5 times
as much of the Web as one engine on average (about twice the coverage of
the largest engine). |
Indexing patterns vary |
The indexing patterns of the engines varies significantly over time, and the engine with the most recent pages may not be the
most comprehensive engine. |
Engines limited? |
The
engines may be limited by network bandwidth, disk storage,
computational power, or a combination of these items (despite claims to
the contrary). |
Coverage
We performed a rigorous statistical study by analyzing the search
engine responses for 575 queries made by employees of the NEC Research
Institute. We did not simply compare the number of documents returned
by each engine, results from such a study would provide inaccurate
estimates of the coverage of the engines. Instead, we downloaded and
analyzed every single page (about 150,000 pages) that each
engine listed, in order to enforce a consistent relevance measure
across all engines (otherwise some engines return documents with
related terms or documents that no longer exist which would make the
results inaccurate).
By analyzing the overlap between engines we estimated a lower bound on
the size of the "publicly indexable Web" at 320 million pages (see
below for more details). The "publicly indexable Web" excludes pages
typically not indexed by the major search engines, e.g. pages behind
search forms or authorization requirements. The following figure shows
the estimated coverage of six major Web search engines compared to the
estimated size of the Web.
For an
indication of the stability of the results versus the number of
queries used, see the coverage
estimates versus the number of queries (a random subset of the
queries was chosen for each point on the graph).
Coverage for what kind of queries?
It is important to note that the queries used in the study were from
the employees of the NEC Research Institute. Most of the employees are
scientists, and scientists tend to search for less "popular", or
harder to find information. This is beneficial when estimating the
size of the Web with the technique in our paper (see below). However,
the search engines are typically biased towards indexing more
"popular" information. Therefore the coverage of the search engines is
typically better for more popular information.
Recency; freshness; invalid links; median document age
The following figure shows the percentage of invalid links for six
major Web search engines. When comparing these results with the
results of similar experiments performed in August 1997, the ranking
of the engines in terms of the percentage of invalid links changed
significantly. Analysis of the median age of documents returned by the
engines shows similar changes from the experiments performed in August
1997. These results suggest that the indexing patterns of the engines
varies significantly over time, and that the engine with the most
recent pages may not be the most comprehensive engine (one factor
involved here may be a tradeoff between the database size and update
frequency).
Estimating the size of the Web
How is the size of the Web estimated? Discrete multivariate analysis
is used to generate an estimate from the overlap between the
individual search engines. The estimate is biased because of the
dependence between the engines in terms of the pages they choose to
index (e.g. people often register pages at multiple engines, and the
engines are typically biased towards indexing more "popular"
pages). This makes our estimate a lower bound. We consider this issue
carefully, e.g. in terms of the which engines are used for producing
the estimate. The fact that the queries used were mostly from
scientists, and for harder to find information, improves the accuracy
of the estimate (if popular queries were used the estimate would be
biased lower). Studies which do not consider these issues will
typically produce a lower estimate of the size of the Web.
Other issues and future studies
There are many other issues not discussed here, for example other ways
to compare search engines (e.g. relevance, query interface), and
suggestions for searching the Web depending on the kind of information
desired. More information can be found in the related interviews and press articles, and our tips for searching the Web. For full details of the study, request a
reprint of the article.
Future studies will update and extend the results shown here. Check
back for updates.
Search Engine Watch: News, tips and more about search engines, by Danny Sullivan
Copyright © 1998 NEC Research Institute