Representative random sampling for feature engineering of -Omics data: using machine learning to identify biomarkers for head and neck squamous cell carcinoma

Michael C. Rendleman

doi:10.17077/etd.006280

Back

Representative random sampling for feature engineering of -Omics data: using machine learning to identify biomarkers for head and neck squamous cell carcinoma

Dissertation

Open access

Representative random sampling for feature engineering of -Omics data: using machine learning to identify biomarkers for head and neck squamous cell carcinoma

Michael C. Rendleman

University of Iowa

Doctor of Philosophy (PhD), University of Iowa

Autumn 2021

DOI: 10.17077/etd.006280

Files and links (1)

pdf

MCR Dissertation Final corrected again3.45 MBDownload View

Free to read and download, Open Access

Abstract

High-dimensional cancer data can be burdensome to analyze, with complex relationships between molecular measurements, clinical diagnostics, and treatment outcomes. Data-driven computational approaches may be key to identifying research targets with potential clinical or research use, also known as biomarkers. To this end, we designed a framework for engineering and identifying biomarkers for survival model building, applying a variety of established and novel feature engineering methods on publicly available Head and Neck Squamous Cell Carcinoma (HNSCC) data. This dataset includes over 500 cases and spans numerous data types including clinical data, RNA sequencing, and tumor-normal DNA variation.Given the limited size of the dataset, a specialized sampling technique was devised to increase reliability of performance estimation with less computation. Traditionally, resampling methods such as cross validation or repeated holdout have been used to estimate model performance, as they produce more robust estimates. Because exploratory evaluations in the feature selection framework required an intractable manual inspection and assessment process, we propose employing a novel holdout sampling procedure, Representative Random Sampling (RRS). RRS first quantizes the continuous outcome into equipopulous bins of minimum size and then selects the holdout set via stratified sampling. Utilizing thorough simulations on synthetic molecular data, we have determined that this approach yields at least modest reductions in error and bias when compared to standard holdout, though direct cross validation may still be significantly more effective at reducing error and bias. Additionally, model selection has a large effect on error and bias estimation: RRS produced the most consistent decreases in error and bias with random forest-based models. Using RRS, a two-stage analysis framework enables evaluation and selection of prospective biomarker features which are then applied to survival modeling. Thousands of raw and processed molecular features were assessed on their ability to predict clinical diagnostics and patient survival, ultimately supporting a predictive survival model that outperformed corresponding clinical models. Model analysis demonstrated associations between patient outcomes and biological pathways and processes, several of which are the subject of recent and ongoing oncology research in HNSCC and other cancers. Additionally, unsupervised transformations of RNA expression data facilitated by denoising autoencoders (DAE) were found to strengthen prognostic models against overfitting and in predictive performance.

Machine Learning

Oncology

feature engineering

squamous cell carcinoma

stratified sampling

survival prediction

unsupervised transformations

Details

Title: Subtitle: Representative random sampling for feature engineering of -Omics data: using machine learning to identify biomarkers for head and neck squamous cell carcinoma
Creators: Michael C. Rendleman
Contributors: Thomas L Casavant (Advisor)
Terry A Braun (Committee Member)
John M Buatti (Committee Member)
Guadalupe Canahuate (Committee Member)
Brian J Smith (Committee Member)
Resource Type: Dissertation
Degree Awarded: Doctor of Philosophy (PhD), University of Iowa
Degree in: Electrical and Computer Engineering
Date degree season: Autumn 2021
DOI: 10.17077/etd.006280
Publisher: University of Iowa
Number of pages: xv, 93 pages
Language: English
Description illustrations: illustrations (some color)
Description bibliographic: Includes bibliographical references (pages 85-88).
Public Abstract (ETD): Cancer is a class of diseases that 40% of people are diagnosed with at some point in their lives. In modern medicine, new technologies are changing how we study and understand these diseases. In this thesis, we focus on a class of cancers called Head and Neck squamous cell carcinoma (HNSCC). It ranks 6th in the world by prevalence and is associated with human papillomavirus (HPV) as well as the use of tobacco and alcohol.

Precision medicine is guiding the development of new tests and treatments. Cancer researchers are doing more data collection than ever before, including whole genome and tumor DNA sequencing. A challenge with this kind of data is that it can be voluminous, overwhelming, and difficult to interpret. To be able to make sense of this complex data, researchers use computers and statistics. In recent years, some have been able to apply newer tools such as machine learning to aid in their research.

We propose a new method that can produce more accurate results with limited data. By applying this new approach to publicly-available cancer data, we compare different ways of using machine learning to study HNSCC. Results show that transforming the data with a trained neural network was capable of improving prediction of treatment outcomes. Additionally, this kind of transformed data may prove useful in the diagnosis and treatment of HNSCC.
Academic Unit: Electrical and Computer Engineering
Record Identifier: 9984210527002771

Metrics

25 File views/ downloads

41 Record Views