Journal article
Significant DBSCAN plus : Statistically Robust Density-based Clustering
ACM transactions on intelligent systems and technology, Vol.12(5), 62
12/01/2021
DOI: 10.1145/3474842
Abstract
Cluster detection is important and widely used in a variety of applications, including public health, public safety, transportation, and so on. Given a collection of data points, we aim to detect density-connected spatial clusters with varying geometric shapes and densities, under the constraint that the clusters are statistically significant. The problem is challenging, because many societal applications and domain science studies have low tolerance for spurious results, and clusters may have arbitrary shapes and varying densities. As a classical topic in data mining and learning, a myriad of techniques have been developed to detect clusters with both varying shapes and densities (e.g., density-based, hierarchical, spectral, or deep clustering methods). However, the vast majority of these techniques do not consider statistical rigor and are susceptible to detecting spurious clusters formed as a result of natural randomness. On the other hand, scan statistic approaches explicitly control the rate of spurious results, but they typically assume a single "hotspot" of over-density and many rely on further assumptions such as a tessellated input space. To unite the strengths of both lines of work, we propose a statistically robust formulation of a multi-scale DBSCAN, namely Significant DBSCAN+, to identify significant clusters that are density connected. As we will show, incorporation of statistical rigor is a powerful mechanism that allows the new Significant DBSCAN+ to outperform state-of-the-art clustering techniques in various scenarios. We also propose computational enhancements to speed-up the proposed approach. Experiment results show that Significant DBSCAN+ can simultaneously improve the success rate of true cluster detection (e.g., 10-20% increases in absolute F1 scores) and substantially reduce the rate of spurious results (e.g., from thousands/hundreds of spurious detections to none or just a few across 100 datasets), and the acceleration methods can improve the efficiency for both clustered and non-clustered data.
Details
- Title: Subtitle
- Significant DBSCAN plus : Statistically Robust Density-based Clustering
- Creators
- Yiqun Xie - Univ Maryland, 1124 Lefrak Hall,7251 Preinkert Dr, College Pk, MD 20742 USAXiaowei Jia - University of PittsburghShashi Shekhar - University of MinnesotaHan Bao - Univ Iowa, 108 John Pappajohn Business Bldg, Iowa City, IA 52242 USAXun Zhou - Univ Iowa, 108 John Pappajohn Business Bldg, Iowa City, IA 52242 USA
- Resource Type
- Journal article
- Publication Details
- ACM transactions on intelligent systems and technology, Vol.12(5), 62
- Publisher
- Assoc Computing Machinery
- DOI
- 10.1145/3474842
- ISSN
- 2157-6904
- eISSN
- 2157-6912
- Number of pages
- 26
- Grant note
- 2105133; 2126474; 1901099; 1737633; IIS1320580; IIS-0940818; IIS-1218168; 1916518 / NSF; National Science Foundation (NSF) HM0476-20-1-0009 / USDOD; United States Department of Defense Google's AI for Social Good Impact Scholars program 2017-51181-27222 / USDA; United States Department of Agriculture (USDA) Minnesota Supercomputing Institute DE-AR0000795 / USDOE (ARPA-E); United States Department of Energy (DOE) UL1 TR002494; KL2TR002492; TL1 TR002493 / NIH; United States Department of Health & Human Services; National Institutes of Health (NIH) - USA Dean's Research Initiative Award at the University of Maryland Pitt Momentum Fund Award 69A3551747131 / Safety Research using Simulation University Transportation Center (SAFER-SIM) - US-DOT's University Transportation Centers Program G21AC10207 / USGS; United States Geological Survey
- Language
- English
- Date published
- 12/01/2021
- Academic Unit
- Business Analytics
- Record Identifier
- 9984380651302771
Metrics
5 Record Views