Journal article
A General Evaluation Framework for Topical Crawlers
Information retrieval (Boston), Vol.8(3), pp.417-447
01/2005
DOI: 10.1007/s10791-005-6993-5
Abstract
Topical crawlers are becoming important tools to support applications such as specialized Web portals, online searching, and competitive intelligence. As the Web mining field matures, the disparate crawling strategies proposed in the literature will have to be evaluated and compared on common tasks through well-defined performance measures. This paper presents a general framework to evaluate topical crawlers. We identify a class of tasks that model crawling applications of different nature and difficulty. We then introduce a set of performance measures for fair comparative evaluations of crawlers along several dimensions including generalized notions of precision, recall, and efficiency that are appropriate and practical for the Web. The framework relies on independent relevance judgements compiled by human editors and available from public directories. Two sources of evidence are proposed to assess crawled pages, capturing different relevance criteria. Finally we introduce a set of topic characterizations to analyze the variability in crawling effectiveness across topics. The proposed evaluation framework synthesizes a number of methodologies in the topical crawlers literature and many lessons learned from several studies conducted by our group. The general framework is described in detail and then illustrated in practice by a case study that evaluates four public crawling algorithms. We found that the proposed framework is effective at evaluating, comparing, differentiating and interpreting the performance of the four crawlers. For example, we found the IS crawler to be most sensitive to the popularity of topics.
Details
- Title: Subtitle
- A General Evaluation Framework for Topical Crawlers
- Creators
- P Srinivasan - School of Library & Information Science and Department of Management Sciences The University of Iowa Iowa City IA 52242 USAF Menczer - School of Informatics and Department of Computer Science Indiana University Bloomington IN 47408 USAG Pant - School of Accounting and Information Systems University of Utah Salt Lake City UT 84112 USA
- Resource Type
- Journal article
- Publication Details
- Information retrieval (Boston), Vol.8(3), pp.417-447
- Publisher
- Kluwer Academic Publishers; Boston
- DOI
- 10.1007/s10791-005-6993-5
- ISSN
- 1386-4564
- eISSN
- 1573-7659
- Language
- English
- Date published
- 01/2005
- Academic Unit
- Nursing; Computer Science; Business Analytics
- Record Identifier
- 9984003179102771
Metrics
25 Record Views