The good news about the Internet and its most visible component, the World
Wide Web, is that there are hundreds of millions of pages available, waiting
to present information on an amazing variety of topics. The bad news about
the Internet is that there are hundreds of millions of pages available, most
of them titled according to the whim of their author, almost all of them
sitting on
servers with cryptic names. When you need to know about a particular
subject, how do you know which pages to read? If you're like most people,
you visit an Internet search engine.
Internet search engines are special sites on the Web that are designed to
help people find information stored on other sites. There are differences in
the ways various search engines work, but they all perform three basic
tasks:
- They search the Internet -- or select pieces of the Internet -- based
on important words.
- They keep an index of the words they find, and where they find them.
- They allow users to look for words or combinations of words found in
that index.
Early search engines held an index of a few hundred thousand pages and
documents, and received maybe one or two thousand inquiries each day. Today,
a top search engine will index hundreds of millions of pages, and respond to
tens of millions of queries per day. In this edition of
HowStuffWorks,
we'll tell you how these major tasks are performed, and how Internet search
engines put the pieces together in order to let you find the information you
need on the Web.
Looking at the Web
When most people talk about Internet search engines, they really mean World
Wide Web search engines. Before the Web became the most visible part of the
Internet, there were already search engines in place to help people find
information on the Net. Programs with names like "gopher" and "Archie" kept
indexes of files stored on servers connected to the Internet, and
dramatically reduced the amount of time required to find programs and
documents. In the late 1980s, getting serious value from the Internet meant
knowing how to use gopher, Archie, Veronica and the rest. Today, most
Internet users limit their searches to the Web, so we'll limit this article
to search engines that focus on the contents of
Web pages.
An Itsy-Bitsy Beginning
Before a search engine can tell you where a file or document is, it must be
found. To find information on the hundreds of millions of Web pages that
exist, a search engine employs special software robots, called spiders,
to build lists of the words found on Web sites. When a spider is building
its lists, the process is called Web crawling. (There are some
disadvantages to calling part of the Internet the World Wide Web -- a large
set of arachnid-centric names for tools is one of them.) In order to build
and maintain a useful list of words, a search engine's spiders have to look
at a lot of pages.
How does any spider start its travels over the Web? The usual starting
points are lists of heavily used
servers and
very popular pages. The spider will begin with a popular site, indexing the
words on its pages and following every link found within the site. In this
way, the spidering system quickly begins to travel, spreading out across the
most widely used portions of the Web.
"Spiders" take a Web page's content and create key search words that
enable online users to find pages they're looking for.
|
Google.com began as an academic search engine. In the paper that
describes how the system was built, Sergey Brin and Lawrence Page give an
example of how quickly their spiders can work. They built their initial
system to use multiple spiders, usually three at one time. Each spider could
keep about 300 connections to Web pages open at a time. At its peak
performance, using four spiders, their system could crawl over 100 pages per
second, generating around 600 kilobytes of data each second.
Keeping everything running quickly meant building a system to feed
necessary information to the spiders. The early Google system had a server
dedicated to providing URLs to the spiders. Rather than depending on an
Internet service
provider for the
domain name server (DNS) that translates a server's name into an
address, Google had
its own DNS, in order to keep delays to a minimum.
When the Google spider looked at an
HTML page, it
took note of two things:
- The words within the page
- Where the words were found
Words occurring in the title, subtitles,
meta tags
and other positions of relative importance were noted for special
consideration during a subsequent user search. The Google spider was built
to index every significant word on a page, leaving out the articles "a,"
"an" and "the." Other spiders take different approaches.
These different approaches usually attempt to make the spider operate
faster, allow users to search more efficiently, or both. For example, some
spiders will keep track of the words in the title, sub-headings and links,
along with the 100 most frequently used words on the page and each word in
the first 20 lines of text.
Lycos is said to use this approach to spidering the Web.
Other systems, such as
AltaVista, go in the other direction, indexing every single word on a
page, including "a," "an," "the" and other "insignificant" words. The push
to completeness in this approach is matched by other systems in the
attention given to the unseen portion of the Web page, the meta tags.
Meta Tags
Meta tags allow the owner of a page to specify key words and concepts
under which the page will be indexed. This can be helpful, especially in
cases in which the words on the page might have double or triple meanings --
the meta tags
can guide the search engine in choosing which of the several possible
meanings for these words is correct. There is, however, a danger in
over-reliance on meta tags, because a careless or unscrupulous page owner
might add meta tags that fit very popular topics but have nothing to do with
the actual contents of the page. To protect against this, spiders will
correlate meta tags with page content, rejecting the meta tags that don't
match the words on the page.
All of this assumes that the owner of a page actually wants it to be
included in the results of a search engine's activities. Many times, the
page's owner doesn't want it showing up on a major search engine, or doesn't
want the activity of a spider accessing the page. Consider, for example, a
game that builds new, active pages each time sections of the page are
displayed or new links are followed. If a Web spider accesses one of these
pages, and begins following all of the links for new pages, the game could
mistake the activity for a high-speed human player and spin out of control.
To avoid situations like this, the robot exclusion protocol was
developed. This protocol, implemented in the meta-tag section at the
beginning of a Web page, tells a spider to leave the page alone -- to
neither index the words on the page nor try to follow its links.
Building the Index
Once the spiders have completed the task of finding information on Web pages
(and we should note that this is a task that is never actually completed --
the constantly changing nature of the Web means that the spiders are always
crawling), the search engine must store the information in a way that makes
it useful. There are two key components involved in making the gathered data
accessible to users:
- The information stored with the data
- The method by which the information is indexed
In the simplest case, a search engine could just store the word and the
URL where it was found. In reality, this would make for an engine of limited
use, since there would be no way of telling whether the word was used in an
important or a trivial way on the page, whether the word was used once or
many times or whether the page contained links to other pages containing the
word. In other words, there would be no way of building the ranking
list that tries to present the most useful pages at the top of the list of
search results.
To make for more useful results, most search engines store more than just
the word and URL. An engine might store the number of times that the word
appears on a page. The engine might assign a weight to each entry,
with increasing values assigned to words as they appear near the top of the
document, in sub-headings, in links, in the meta tags or in the title of the
page. Each commercial search engine has a different formula for assigning
weight to the words in its index. This is one of the reasons that a search
for the same word on different search engines will produce different lists,
with the pages presented in different orders.
Regardless of the precise combination of additional pieces of information
stored by a search engine, the data will be encoded to save storage
space. For example, the original Google paper describes using 2
bytes, of 8
bits each, to
store information on weighting -- whether the word was capitalized, its font
size, position, and other information to help in ranking the hit. Each
factor might take up 2 or 3 bits within the 2-byte grouping (8 bits = 1
byte). As a result, a great deal of information can be stored in a very
compact form. After the information is compacted, it's ready for indexing.
An index has a single purpose: It allows information to be found as
quickly as possible. There are quite a few ways for an index to be built,
but one of the most effective ways is to build a hash table. In
hashing,
a formula is applied to attach a numerical value to each word. The formula
is designed to evenly distribute the entries across a predetermined number
of divisions. This numerical distribution is different from the distribution
of words across the alphabet, and that is the key to a hash table's
effectiveness.
In English, there are some letters that begin many words, while others
begin fewer. You'll find, for example, that the "M" section of the
dictionary is much thicker than the "X" section. This inequity means that
finding a word beginning with a very "popular" letter could take much longer
than finding a word that begins with a less popular one. Hashing evens out
the difference, and reduces the average time it takes to find an entry. It
also separates the index from the actual entry. The hash table contains the
hashed number along with a pointer to the actual data, which can be sorted
in whichever way allows it to be stored most efficiently. The combination of
efficient indexing and effective storage makes it possible to get results
quickly, even when the user creates a complicated search.
Building a Search
Searching through an index involves a user building a query and
submitting it through the search engine. The query can be quite simple, a
single word at minimum. Building a more complex query requires the use of
Boolean operators
that allow you to refine and extend the terms of the search.
The Boolean operators most often seen are:
- AND - All the terms joined by "AND" must appear in the pages or
documents. Some search engines substitute the operator "+" for the word
AND.
- OR - At least one of the terms joined by "OR" must appear in
the pages or documents.
- NOT - The term or terms following "NOT" must not appear in the
pages or documents. Some search engines substitute the operator "-" for
the word NOT.
- FOLLOWED BY - One of the terms must be directly followed by the
other.
- NEAR - One of the terms must be within a specified number of
words of the other.
- Quotation Marks - The words between the quotation marks are
treated as a phrase, and that phrase must be found within the document or
file.
Searching for Sport
Search engines have become such an integral part of
our lives that at least one organized game has evolved around this tool.
In Googlewhacking, you type two words into the
Google search engine in the hopes of receiving exactly one
result -- a single Web page on which both of those words appear. This is
a pure whack.
It's quite a difficult task -- you need to choose two completely
unrelated words or else you'll get a whole lot more than one result, but
with many completely unrelated words you get zero results.
If you achieve a pure whack, you can submit it to
www.googlewhack.com, where it is posted in The Whack Stack
(along with your name, or whatever you want to call yourself) for all to
see. One pure whack currently in The Whack Stack is "ambidextrous
scallywags."
|
Future Search
The searches defined by
Boolean operators
are literal searches -- the engine looks for the words or phrases
exactly as they are entered. This can be a problem when the entered words
have multiple meanings. "Bed," for example, can be a place to sleep, a place
where flowers are planted, the storage space of a truck or a place where
fish lay their eggs. If you're interested in only one of these meanings, you
might not want to see pages featuring all of the others. You can build a
literal search that tries to eliminate unwanted meanings, but it's nice if
the search engine itself can help out.
One of the areas of search engine research is concept-based
searching. Some of this research involves using statistical analysis on
pages containing the words or phrases you search for, in order to find other
pages you might be interested in. Obviously, the information stored about
each page is greater for a concept-based search engine, and far more
processing is required for each search. Still, many groups are working to
improve both results and performance of this type of search engine. Others
have moved on to another area of research, called natural-language
queries.
The idea behind natural-language queries is that you can type a question
in the same way you would ask it to a human sitting beside you -- no need to
keep track of Boolean operators or complex query structures. The most
popular natural language query site today is
AskJeeves.com, which parses the query for keywords that it then applies
to the index of sites it has built. It only works with simple queries; but
competition is heavy to develop a natural-language query engine that can
accept a query of great complexity.
F
More Great Links
Search Engines & Web Directories