Page:Untangling the Web.pdf/28



Search Engine Basics

A search engine comprises three basic parts:


 * 1) The spider/robot/crawler is software that "visits" sites on the Internet (each search engine does this differently). The spider reads what is there, follows links at the site, and ultimately brings all that data back to:
 * 2) The search engine index, catalog, or database, where everything the spider found is stored;
 * 3) The search engine software that actually sifts through everything in the index to find matches and then ranks or sorts them into a list of results or hits. Important points to consider about search engines:

 Spiders are programmed to return to websites on a regular basis, but the time interval varies widely from engine to engine. Monthly or better is considered "fresh." When you use a search engine, you are searching the index or database, not the web pages themselves. This is important to remember because no search engine operates in "real time." Spiders do not index all web pages they find, including pages that employ the "Robots Exclusion Protocol" or the "Robots META tag." The first of these mechanisms is a special file website administrators use to indicate which parts of the site should not be visited by the robot or spider. The second is a special HTML metatag that may be inserted by a web page author to indicate if the page may be indexed or analyzed for links. Not every robot/spider  20