Along with the exponential growth of text data on the Web, particularly of the user-generated content, comes to an increasing need for hierarchically organizing documents, retrieving documents, and discovering evolutionary trends of various popular topics from the data. However, all of these are challenging due to the diversity, heterogeneity, noisiness and time-sensitivity of Web 2.0 data. Motivated by this, we tackle the challenges at a fundamental level, by proposing a novel topic modeling method with ontological guidance. It may be used to discover topic language models formalizing various terms relevant to given topics using the Web data. The topic model takes into account both the ontological relationships amongst the topics defined in a topic taxonomy and also word co-occurrence patterns in the data to automatically identify the portions in the data relevant to the topics. Then, it estimates language models for these topics from these relevant portions. At an application level, we use the topic model to propose novel approaches for three different tasks, namely hierarchical text classification without labeled data, information retrieval with pseudo-relevance feedback, and discovering topic evolutionary trends. Our classification experiment on IPTC (International Press and Telecommunications Council) taxonomy, containing more 1100 topics, shows that our approach achieves a performance of 67% in terms of the hierarchical version of the F-1 measure, without using any labeled data. Our retrieval experiments on five benchmark datasets show that compared to baseline retrieval (without pseudo-relevance feedback), our approach improves on average 39% in terms of mean average precision. Finally, for the last task, using blog data, our approach discovers meaningful insights on how the crowd responds to various news topics such as the language used to discuss each topic, how this language drifts over time, and when the crowd's focus on a topic increases, reaches a peak, and declines.
Dissertation
Topic modeling and applications in Web 2.0
University of Iowa
Doctor of Philosophy (PhD), University of Iowa
Spring 2011
DOI: 10.17077/etd.lj87xny2
Free to read and download, Open Access
Abstract
Details
- Title: Subtitle
- Topic modeling and applications in Web 2.0
- Creators
- Ha Thuc Viet - University of Iowa
- Contributors
- Padmini Srinivasan (Advisor)Alberto M. Segre (Committee Member)James Cremer (Committee Member)Kasturi Varadarajan (Committee Member)Nick Street (Committee Member)
- Resource Type
- Dissertation
- Degree Awarded
- Doctor of Philosophy (PhD), University of Iowa
- Degree in
- Computer Science
- Date degree season
- Spring 2011
- Publisher
- University of Iowa
- DOI
- 10.17077/etd.lj87xny2
- Number of pages
- ix, 94 pages
- Copyright
- Copyright 2011 Viet Thuc Ha
- Language
- English
- Description bibliographic
- Includes bibliographical references (pages 87-94).
- Academic Unit
- Computer Science
- Record Identifier
- 9983777267502771
Metrics
2511 File views/ downloads
447 Record Views