Dissertation
Machine learning strategies for discovering genetic patterns in complex and rare conditions
University of Iowa
Doctor of Philosophy (PhD), University of Iowa
Autumn 2024
DOI: 10.25820/etd.007571
Abstract
Clinically important genetic biomarkers associated with rare disorders are challenging to uncover due to often small sample sizes, high-dimensional variant data, and the presence of complex genetic interactions. Data-driven approaches such as machine learning (ML) methods with feature selection strategies are well-suited to overcome the challenges and can offer key biological insights. We developed GenoMLizer, a genome-wide machine learning (ML) tool for the prioritization of variants as genetic modifiers for the development of symptoms/comorbidities associated with rare disorders. To date, GenoMLizer results have provided novel biological insight and additional support to known genetic associations with symptoms/comorbidities of two diseases, COVID-19 and Turner syndrome.
Whole genome sequencing was utilized to investigate the variants and genes associated with the loss of smell (anosmia) and the loss of taste (ageusia) symptoms with COVID-19 in two independent datasets, the University of Iowa (UI) cohort consisting of 187 individuals and the All of Us (AoU) Research Program cohort consisting of 947 individuals. Rare and common variants were utilized with a novel variant and gene prioritization pipeline, utilizing machine learning methods for the classification of individuals. Models were assessed using a permutation-based variable importance (PVI) strategy for the final prioritization of candidate variants and genes along with feature selection. The highest held-out test set area under the receiver operating characteristic (AUROC) curve for models and datasets from the UI cohort was 0.735 and 0.798 for the variant and gene analysis, respectively, and for the AoU cohort was 0.687 for the variant analysis. This analysis prioritized several novel and known candidate genetic factors involved in immune response, neuronal signaling, and calcium signaling, supporting previously proposed hypotheses for anosmia/ageusia with COVID-19. The computational workflow utilized here led to the development of GenoMLizer, which is available for the analysis of genetic modifiers in similar disease datasets.
GenoMLizer includes proven feature selection strategies but also offers optimization for new datasets. It utilizes model feature importance for prioritizing genomic features and model held-out test performance as a confidence metric for candidate selection. The GenoMLizer workflow provided statistically significant results for genetic association with the loss of smell or taste with COVID-19 (p-value ≤ 0.03), prioritized previously implicated collagen genes as genetic modifiers for bicuspid aortic valve with Turner syndrome (N=208), and outperformed other computational methods for similar analyses.
Lastly, ML methods can be applied to numerous biological questions outside of variant association, including prioritization of diagnostic tests for rare cancers such as neuroendocrine tumors (NETs). Differentiating NET primary sites is pivotal for patient care as different subtypes have distinct treatment approaches. ML models were trained on fluorescence in situ hybridization (FISH) assay metrics from 144 samples for primary site prediction. Decision tree (DT) models fit to ten dataset splits achieved a mean accuracy of 81.4% on held-out test sets (majority class accuracy = 59.0%). ERBB2 and MET variables ranked as top-performing features in 9 of 10 DT models and the full dataset model. These findings offer probabilistic guidance for FISH testing, emphasizing the prioritization of the ERBB2, SMAD4, and CDKN2A FISH probes in diagnosing NET primary sites.
Ultimately, these studies add to the existing body of literature and provide additional support for previously implicated genes and proposed hypotheses. This work increases our understanding of the genetics involved with the loss of smell and taste with COVID-19, bicuspid aortic valve with Turner syndrome, and helps increase patient care, potentially allowing for the best treatment approach to be implemented with subtypes of NETs. Finally, GenoMLizer provides a standardized best practice whole genome ML analysis for the investigation of genetic modifiers with rare disorders.
Details
- Title: Subtitle
- Machine learning strategies for discovering genetic patterns in complex and rare conditions
- Creators
- Lucas Pietan
- Contributors
- Thomas Casavant (Advisor)Benjamin Darbro (Advisor)Terry Braun (Committee Member)Brian Smith (Committee Member)Michael Schnieders (Committee Member)
- Resource Type
- Dissertation
- Degree Awarded
- Doctor of Philosophy (PhD), University of Iowa
- Degree in
- Genetics
- Date degree season
- Autumn 2024
- DOI
- 10.25820/etd.007571
- Publisher
- University of Iowa
- Number of pages
- xviii, 299 pages
- Copyright
- Copyright 2024 Lucas Pietan
- Language
- English
- Date submitted
- 12/02/2024
- Description illustrations
- illustrations, tables, graphs
- Description bibliographic
- Includes bibliographical references (pages 126-132).
- Public Abstract (ETD)
- The cause of complex genetic traits or disorders involves numerous genetic factors. Studying these conditions requires examining the whole genome for associations among over 20,000 genes and often tens of millions of variants, necessitating large sample sizes. Investigating complex genetic conditions accompanying rare disorders with inherently small sample sizes adds to the challenge. Machine learning (ML) techniques can handle the vast number of genes and variants and detect complex patterns within the data, even with limited sample sizes, that standard methods cannot. In this thesis, a novel ML tool, GenoMLizer, has been developed, tested, and validated on real disease datasets. GenoMLizer has been used to analyze a whole genome sequence, select informative genetic factors, and utilize ML models to prioritize candidate variants and genes. GenoMLizer has highlighted genes in the immune response, neuronal signaling, and calcium signaling pathways for the development of the loss of smell or taste with COVID-19 and collagen pathways for bicuspid aortic valve with Turner syndrome, aligning with previously proposed hypotheses. GenoMLizer results also outperform the standard methods in the field. In a final similar study, ML methods obtained high performance and were able to prioritize diagnostic tests for differentiating neuroendocrine tumor subtypes. These studies increase our understanding of the development of loss of smell or taste with COVID-19 and bicuspid aortic valve with Turner syndrome and help increase patient care involved with neuroendocrine tumors. Finally, GenoMLizer provides a standardized whole genome ML workflow for the study of similar complex conditions with rare disorders.
- Academic Unit
- Interdisciplinary Graduate Program in Genetics
- Record Identifier
- 9984774456802771
Metrics
1 File views/ downloads
6 Record Views