Conference proceeding
Apophanies or Epiphanies? How Crawlers Impact Our Understanding of the Web
Proceedings of The Web Conference 2020, pp.271-280
WWW '20
04/20/2020
DOI: 10.1145/3366423.3380113
Abstract
Data generated by web crawlers has formed the basis for much of our current understanding of the Internet. However, not all crawlers are created equal and crawlers generally find themselves trading off between computational overhead, developer effort, data accuracy, and completeness. Therefore, the choice of crawler has a critical impact on the data generated and knowledge inferred from it. In this paper, we conduct a systematic study of the trade-offs presented by different crawlers and the impact that these can have on various types of measurement studies. We make the following contributions: First, we conduct a survey of all research published since 2015 in the premier security and Internet measurement venues to identify and verify the repeatability of crawling methodologies deployed for different problem domains and publication venues. Next, we conduct a qualitative evaluation of a subset of all crawling tools identified in our survey. This evaluation allows us to draw conclusions about the suitability of each tool for specific types of data gathering. Finally, we present a methodology and a measurement framework to empirically highlight the differences between crawlers and how the choice of crawler can impact our understanding of the web.
Details
- Title: Subtitle
- Apophanies or Epiphanies? How Crawlers Impact Our Understanding of the Web
- Creators
- Syed Suleman Ahmad - University of Wisconsin–MadisonMuhammad Daniyal Dar - University of IowaMuhammad Fareed Zaffar - Lahore University of Management SciencesNarseo Vallina-Rodriguez - IMDEA NetworksRishab Nithyanand - University of Iowa
- Resource Type
- Conference proceeding
- Publication Details
- Proceedings of The Web Conference 2020, pp.271-280
- Series
- WWW '20
- DOI
- 10.1145/3366423.3380113
- Publisher
- ACM
- Language
- English
- Date published
- 04/20/2020
- Academic Unit
- Center for Social Science Innovation; Computer Science; Public Policy Center (Archive)
- Record Identifier
- 9984259463402771
Metrics
50 Record Views