May 13, 2010

Measuring Relevance and Popularity

Modern commercial search engines rely on the science of information retrieval (IR). That science has existed since the middle of the 20th century, when retrieval systems powered computers in libraries, research facilities, and government labs. Early in the development of search systems, IR scientists realized that two critical components made up the majority of search functionality:

Relevance - the degree to which the content of the documents returned in a search matched the user's query intention and terms. The relevance of a document increases if the terms or phrase queried by the user occurs multiple times and shows up in the title of the work or in important headlines or subheaders.

Popularity - the relative importance, measured via citation (the act of one work referencing another, as often occurs in academic and business documents) of a given document that matches the user's query. The popularity of a given document increases with every other document that references it.

These two items were translated to web search 40 years later and manifest themselves in the form of document analysis and link analysis.

In document analysis, search engines look at whether the search terms are found in important areas of the document - the title, the meta data, the heading tags, and the body of text content. They also attempt to automatically measure the quality of the document (through complex systems beyond the scope of this guide).

In link analysis, search engines measure not only who is linking to a site or page, but what they are saying about that page/site. They also have a good grasp on who is affiliated with whom (through historical link data, the site's registration records, and other sources), who is worthy of being trusted (links from .edu and .gov pages are generally more valuable for this reason), and contextual data about the site the page is hosted on (who links to that site, what they say about the site, etc.).

Link and document analysis combine and overlap hundreds of factors that can be individually measured and filtered through the search engine algorithms (the set of instructions that tells the engines what importance to assign to each factor). The algorithm then determines scoring for the documents and (ideally) lists results in decreasing order of importance (rankings).

No comments:

Post a Comment

Thanks For Commenting