Search engines have been around since the nascent beginnings of the web in 1994. Many people think of Yahoo! as the first search engine, but Yahoo! was in fact a directory first and then later began searching the web. The difference between a search engine & a directory is the method by which information is stored & retrieved. In a directory, like Yahoo!, DMOZ or SOCEngine, web sites are submitted and added into specific categories with a review by a human being to help categorize them and weed out the sites that don't belong. In a search engine, web pages are crawled automatically by a search engine spider - a computerized bot that indexes and catalogues pages automatically. The search engine keeps a cached copy of the page in its index and can retrieve that information to decide when to display a link to the page to a user who is searching.
Let's examine the primary functions of a search engine:
Spider pages on the web by visiting a page, then following all of the links on that page and spidering the next set of pages (this process is repeated infinitely).
Cache/index each spidered page in an enormous database of web pages that is easily and quickly accessible for search.
Create an algorithm that ranks web search results based on factors that will list the results from most relevant to least relevant.
Return results to users based on the search queries they enter.
The modern search engine is like a giant catalogue of every web page it's spiders have crawled. Google, Yahoo!, MSN, Teoma & others store billions of web pages in their server banks, ready to call upon any of them to appear in the search results should they properly match a query done by a user. At the present time, search engine index sizes are probably close to these numbers:
Google ~8 billion pages
Yahoo! ~6 billion pages
MSN Beta ~4 billion pages
Teoma (Ask Jeeves) ~4 billion pages
Gigablast ~1 billion pages
Search engine index size is actually an excellent measurement of quality and thoroughness. This is not only because search engines with more pages indexed can return more results, but because these higher powered engines spider pages more frequently and thus have the freshest data available, as well as the best understanding of the web's link structure (which pages link to which other pages), providing a better measure of popularity and quality.