Focused Crawling Using Context Graphs

M. Diligenti, F. M. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused crawling using context graphs. In Proceedings of the 26th VLDB Conference, pages 527–534, Cairo, Egypt, 2000. [pdf]

——————-

This article presents an approach to web crawling which is defined as focused as opposed to standard crawling. Standard crawling uses a technique called ‘forward crawling’, that is following the links found on a page to the nexts. This has the limit to not exploit the lower levels of the trees of a web site from the point of entry of the crawler.

The authors also list other limits of the standard crawling technique: a) the limited ability to sacrifice short term document retrieval gains in the interest of better crawl performance; b) the lack of learning strategies where topically relevant documents are found by following off-topic pages.

The authors’ contribution is the Context Focused Crawler, an engine that uses Google or Altavista to define the ‘backward crawling’ from a certain page. This is used to build a context graph of the page: a network of documents that moves from the target document backwards following the hyperlinks.

Afterwards these documents are used to learn some classifier using a reduced version of the TF/IDF algorithm. Subsequently, by computing a naive bayes likelihood function for the winning layer, and comparing this to a threshold, it is possible to discard weakly matching pages.

A side remark is that the authors uses the name Context Graph but this has not to be confused with the Ceglowski [2003] Contextual Network Graph. The reason is that the graph is not assembled in the same way using a Term/Document procedure and also that the retrieval function is not base on Spreading Activation technique.

Tags: ,

Leave a Reply