May 17, 2010

What a Search Engines Spider Does?

The first thing that you need to understand is what a search engine "spider" is, and how it works. A "spider" (also known as a "robot" or "crawler") is a software program that search engines use to find what’s out there on the ever-changing web.
There are many types of spider in use, but for now, we’re only interested in the one that actually "crawls" the web finding pages. This is a somewhat oversimplified picture, but basically this program starts at a website, loads the pages, and follows the hyperlinks on each page. In this way, the theory goes, everything on the web will eventually be found, as the spider crawls from one website to another. Search engines may run thousands of instances of their web-crawling spider programs simultaneously, on multiple servers.
When a "crawler" visits one of your web pages, it loads the page’s contents into a database. Once a page has been fetched, the text of your page is loaded into the search engine’s index, which is a massive database of words, and where they occur on different web pages.
So there are really three steps. It starts with crawling (fetching pages), then indexing (breaking them down into words for the index), and a final step where the links (web page addresses / URLs) that are found get fed back into the crawling program to be retrieved.
When the spider (some of them will check later to verify that a page really is offline) doesn't find a page, it will eventually be deleted from the index. This is one reason why it’s important to use a reliable web hosting provider.

No comments:

Post a Comment

Thanks For Commenting