M. Diligenti, F. M. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused crawling using context graphs. In Proceedings of the 26th VLDB Conference, pages 527–534, Cairo, Egypt, 2000. [pdf]
——————-
This article presents an approach to web crawling which is defined as focused as opposed to standard crawling. Standard crawling uses a technique called ‘forward crawling’, that is following the links found on a page to the nexts. This has the limit to not exploit the lower levels of the trees of a web site from the point of entry of the crawler.
The authors also list other limits of the standard crawling technique: a) the limited ability to sacrifice short term document retrieval gains in the interest of better crawl performance; b) the lack of learning strategies where topically relevant documents are found by following off-topic pages.
The authors’ contribution is the Context Focused Crawler, an engine that uses Google or Altavista to define the ‘backward crawling’ from a certain page. This is used to build a context graph of the page: a network of documents that moves from the target document backwards following the hyperlinks.
Afterwards these documents are used to learn some classifier using a reduced version of the TF/IDF algorithm. Subsequently, by computing a naive bayes likelihood function for the winning layer, and comparing this to a threshold, it is possible to discard weakly matching pages.
A side remark is that the authors uses the name Context Graph but this has not to be confused with the Ceglowski [2003] Contextual Network Graph. The reason is that the graph is not assembled in the same way using a Term/Document procedure and also that the retrieval function is not base on Spreading Activation technique.
Tags: google, information retrieval