A quality-threshold data summarization algorithm

Viet Ha-Thuc; Duc-Cuong Nguyen; Padmini Srinivasan

doi:10.1109/RIVF.2008.4586362

Back

Conference proceeding

A quality-threshold data summarization algorithm

Viet Ha-Thuc, Duc-Cuong Nguyen and Padmini Srinivasan

2008 IEEE International Conference on Research, Innovation and Vision for the Future in Computing and Communication Technologies, pp.240-246

07/2008

DOI: 10.1109/RIVF.2008.4586362

View Online

Abstract

As database sizes increase, semantic data summarization techniques have been developed, so that data mining algorithms can be run on the summarized set for the sake of efficiency. Clustering algorithms such as K-Means have popularly been used as semantic summarization methods where cluster centers become the summarized set. The goal of semantic summarization is to provide a summarized view of the original dataset such that the summarization ratio is maximized while the error (i.e., information loss) is minimized. This paper presents a new clustering-based data summarization algorithm, in which the quality of the summarized set can be controlled. The algorithm partitions a dataset into a number of clusters until the distortion of each cluster is less than a given threshold, thus guaranteeing the summarized set has less than a fixed amount of information loss. Based on the threshold, the number of clusters is automatically determined. The proposed algorithm, unlike traditional K-Means, adjusts initial centers based on the information about the data space discovered so far, thus significantly alleviating the local optimum effect. Our experiments show that our algorithm generates higher quality clusters than K-Means does and it also guarantees an error bound, an essential criterion for data summarization.

Data Summarization (or Compression)

K-Means Clustering

Details

Title: Subtitle: A quality-threshold data summarization algorithm
Creators: Viet Ha-Thuc - Comput. Sci. Dept., Univ. of Iowa, Iowa City, IA
Duc-Cuong Nguyen
Padmini Srinivasan - University of Iowa, Computer Science
Resource Type: Conference proceeding
Publication Details: 2008 IEEE International Conference on Research, Innovation and Vision for the Future in Computing and Communication Technologies, pp.240-246
DOI: 10.1109/RIVF.2008.4586362
Publisher: IEEE
Language: English
Date published: 07/2008
Academic Unit: Nursing; Computer Science; Business Analytics
Record Identifier: 9984003796002771

Metrics

33 Record Views