Conference proceeding
OCFS: optimal orthogonal centroid feature selection for text categorization
Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval, pp.122-129
SIGIR '05
08/15/2005
DOI: 10.1145/1076034.1076058
Abstract
Text categorization is an important research area in many Information Retrieval (IR) applications. To save the storage space and computation time in text categorization, efficient and effective algorithms for reducing the data before analysis are highly desired. Traditional techniques for this purpose can generally be classified into feature extraction and feature selection . Because of efficiency, the latter is more suitable for text data such as web documents. However, many popular feature selection techniques such as Information Gain (IG) andχ 2 -test (CHI) are all greedy in nature and thus may not be optimal according to some criterion. Moreover, the performance of these greedy methods may be deteriorated when the reserved data dimension is extremely low. In this paper, we propose an efficient optimal feature selection algorithm by optimizing the objective function of Orthogonal Centroid (OC) subspace learning algorithm in a discrete solution space, called Orthogonal Centroid Feature Selection (OCFS). Experiments on 20 Newsgroups (20NG), Reuters Corpus Volume 1 (RCV1) and Open Directory Project (ODP) data show that OCFS is consistently better than IG and CHI with smaller computation time especially when the reduced dimension is extremely small.
Details
- Title: Subtitle
- OCFS: optimal orthogonal centroid feature selection for text categorization
- Creators
- Jun Yan - Peking UniversityNing Liu - Tsinghua UniversityBenyu Zhang - Microsoft Research AsiaShuicheng Yan - Chinese University of Hong KongZheng Chen - Microsoft Research AsiaQiansheng Cheng - Peking UniversityWeiguo Fan - Virginia TechWei-Ying Ma - Microsoft Research Asia
- Resource Type
- Conference proceeding
- Publication Details
- Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval, pp.122-129
- Publisher
- ACM
- Series
- SIGIR '05
- DOI
- 10.1145/1076034.1076058
- Language
- English
- Date published
- 08/15/2005
- Academic Unit
- Business Analytics
- Record Identifier
- 9984380494402771
Metrics
4 Record Views