Agglomerative and divisive hierarchical bayesian clustering with methods for longitudinal and time-to-event data

Elliot Burghardt

doi:10.25820/etd.007108

Back

Agglomerative and divisive hierarchical bayesian clustering with methods for longitudinal and time-to-event data

Dissertation

Open access

Agglomerative and divisive hierarchical bayesian clustering with methods for longitudinal and time-to-event data

Elliot Burghardt

University of Iowa

Doctor of Philosophy (PhD), University of Iowa

Spring 2023

DOI: 10.25820/etd.007108

Files and links (1)

pdf

EBurghardt Thesis775.44 kBDownload View

Free to read and download, Open Access

Abstract

Cluster analysis is a common unsupervised learning method used to discover subgroups in datasets by uncovering hidden patterns in data. Clustering methods often require the number of clusters to be specified, are restricted to certain data types through explicit or implicit assumptions, and are incompatible with temporal data. This thesis presents model-based Bayesian hierarchical clustering methods developed to address these issues. For cross-sectional data, agglomerative and divisive algorithms are presented that return nested clustering configurations and provide principled guidance on the plausible number of clusters. These algorithms performed better than other commonly used cluster analysis methods on clustering benchmark data, with the divisive method having the best performance of all methods tested. In order to cluster progressions of a heterogeneous disease with a complex disease process, the divisive methods were expanded to cluster using multiple longitudinal and/or time-to-event variables. By clustering based on models for multivariate longitudinal trajectories and semi-parametric models for survival data, these methods address problems unique to clustering temporal data, including temporal dependencies, correlations between clustering variables, missing data, and censoring. Using these methods, subgroups in idiopathic Parkinson’s disease were identified by differing progression patterns.

cluster analysis

Dirichlet concentration parameter

Dirichlet distribution

finite mixture model

hierarchical clustering algorithms

longitudinal data analysis

Details

Title: Subtitle: Agglomerative and divisive hierarchical bayesian clustering with methods for longitudinal and time-to-event data
Creators: Elliot Burghardt
Contributors: Daniel Sewell (Advisor)
Joseph Cavanaugh (Advisor)
Patrick Breheny (Committee Member)
Jeffrey Long (Committee Member)
Brian Smith (Committee Member)
Resource Type: Dissertation
Degree Awarded: Doctor of Philosophy (PhD), University of Iowa
Degree in: Biostatistics
Date degree season: Spring 2023
DOI: 10.25820/etd.007108
Publisher: University of Iowa
Number of pages: ix, 96 pages
Language: English
Date submitted: 05/01/2022
Date approved: 06/30/2023
Description illustrations: illustrations, tables, graphs
Description bibliographic: Includes bibliographical references (pages 58-67).
Public Abstract (ETD): Cluster analysis methods expose hidden structure in datasets, revealing groups of subjects by their similarities. Sorting subjects into groups is straightforward when groups are labeled, but performing cluster analysis—or discovering group structure—is more challenging. Group labels and characteristics and the number of groups are unknown before clustering. By recognizing groups, data can be organized for better understanding. In medicine, there are many applications of cluster analysis. Examples include: differentiating between various healthy and pathological tissue types in medical imaging; identifying cell types by their markers, size, and/or morphology in flow cytometry; explaining the heterogeneity within a disease population by identifying subpopulations—a step toward better treatments and personalized medicine. Hierarchical clustering generates reasonable cluster configurations for each number of clusters, ranging from one to the number of observations. While hierarchical clustering approaches generate reasonable configurations for each number of clusters, the number of clusters must be provided in order to select the final clustering configuration. Model-based clustering can facilitate interpretability and yield measures of uncertainty; however, it does not eliminate the need to specify the optimal number of clusters. The Bayesian model-based hierarchical clustering approaches presented in this thesis utilize underlying probabilistic assumptions to provide an informative metric on the favorability of each merge or split and guide the number of clusters. These methods have been extended to fit longitudinal and time-to-event data. Clustering performance is evaluated on benchmark clustering data, and the methods are implemented to identify subpopulations in Parkinson’s disease.
Academic Unit: Biostatistics
Record Identifier: 9984428943602771

Metrics

60 Record Views