Dissertation
Agglomerative and divisive hierarchical bayesian clustering with methods for longitudinal and time-to-event data
University of Iowa
Doctor of Philosophy (PhD), University of Iowa
Spring 2023
DOI: 10.25820/etd.007108
Abstract
Cluster analysis is a common unsupervised learning method used to discover subgroups in datasets by uncovering hidden patterns in data. Clustering methods often require the number of clusters to be specified, are restricted to certain data types through explicit or implicit assumptions, and are incompatible with temporal data. This thesis presents model-based Bayesian hierarchical clustering methods developed to address these issues. For cross-sectional data, agglomerative and divisive algorithms are presented that return nested clustering configurations and provide principled guidance on the plausible number of clusters. These algorithms performed better than other commonly used cluster analysis methods on clustering benchmark data, with the divisive method having the best performance of all methods tested. In order to cluster progressions of a heterogeneous disease with a complex disease process, the divisive methods were expanded to cluster using multiple longitudinal and/or time-to-event variables. By clustering based on models for multivariate longitudinal trajectories and semi-parametric models for survival data, these methods address problems unique to clustering temporal data, including temporal dependencies, correlations between clustering variables, missing data, and censoring. Using these methods, subgroups in idiopathic Parkinson’s disease were identified by differing progression patterns.
Details
- Title: Subtitle
- Agglomerative and divisive hierarchical bayesian clustering with methods for longitudinal and time-to-event data
- Creators
- Elliot Burghardt
- Contributors
- Daniel Sewell (Advisor)Joseph Cavanaugh (Advisor)Patrick Breheny (Committee Member)Jeffrey Long (Committee Member)Brian Smith (Committee Member)
- Resource Type
- Dissertation
- Degree Awarded
- Doctor of Philosophy (PhD), University of Iowa
- Degree in
- Biostatistics
- Date degree season
- Spring 2023
- DOI
- 10.25820/etd.007108
- Publisher
- University of Iowa
- Number of pages
- ix, 96 pages
- Copyright
- Copyright 2023 Elliot Burghardt
- Language
- English
- Date submitted
- 05/01/2022
- Date approved
- 06/30/2023
- Description illustrations
- illustrations, tables, graphs
- Description bibliographic
- Includes bibliographical references (pages 58-67).
- Public Abstract (ETD)
- Cluster analysis methods expose hidden structure in datasets, revealing groups of subjects by their similarities. Sorting subjects into groups is straightforward when groups are labeled, but performing cluster analysis—or discovering group structure—is more challenging. Group labels and characteristics and the number of groups are unknown before clustering. By recognizing groups, data can be organized for better understanding. In medicine, there are many applications of cluster analysis. Examples include: differentiating between various healthy and pathological tissue types in medical imaging; identifying cell types by their markers, size, and/or morphology in flow cytometry; explaining the heterogeneity within a disease population by identifying subpopulations—a step toward better treatments and personalized medicine. Hierarchical clustering generates reasonable cluster configurations for each number of clusters, ranging from one to the number of observations. While hierarchical clustering approaches generate reasonable configurations for each number of clusters, the number of clusters must be provided in order to select the final clustering configuration. Model-based clustering can facilitate interpretability and yield measures of uncertainty; however, it does not eliminate the need to specify the optimal number of clusters. The Bayesian model-based hierarchical clustering approaches presented in this thesis utilize underlying probabilistic assumptions to provide an informative metric on the favorability of each merge or split and guide the number of clusters. These methods have been extended to fit longitudinal and time-to-event data. Clustering performance is evaluated on benchmark clustering data, and the methods are implemented to identify subpopulations in Parkinson’s disease.
- Academic Unit
- Biostatistics
- Record Identifier
- 9984428943602771
Metrics
60 Record Views