Penalized linear mixed models for structured genetic data

Anna C Reisetter

doi:10.17077/etd.005994

Back

Penalized linear mixed models for structured genetic data

Dissertation

Open access

Penalized linear mixed models for structured genetic data

Anna C Reisetter

University of Iowa

Doctor of Philosophy (PhD), University of Iowa

Summer 2021

DOI: 10.17077/etd.005994

Files and links (1)

pdf

Reisetter_thesis212.08 MBDownload View

Free to read and download, Open Access

Abstract

Genetic association studies have enhanced our understanding of the genetic basis of quantitative traits and disease. To that end, accurately identifying genotype-phenotype associations is of critical importance. Such associations may be used in a myriad of medical and scientific research including drug discovery, predictive models of disease, and the development of genetic risk scores. Penalized regression methods are a valuable tool with which to identify such associations in high-dimensions, a common feature of genetic data. However, these methods are based on a loss function motivated by independence among subjects. This assumption is often violated in GWAS data due to the presence of population stratification, cryptic relatedness, and unobserved confounding effects. These factors result in complex sample structures, which, when unaccounted for, may result in biased estimates and spurious associations. Penalized linear mixed models (LMMs) have been developed to accurately identify genotype-phenotype associations in the presence of such structured samples. In spite of this, the statistical properties of these models are not well understood and their appropriate implementation has not been explicitly studied. In addition, there is a lack of available software for their utilization. The first objective of this dissertation is to provide a detailed review of penalized LMMs for the analysis of structured genetic data, while examining their statistical properties in the genetic association setting. Second, we evaluate the statistical properties of penalized LMMs in a general setting. We develop appropriate methods for centering and scaling data for penalized LMMs, and present an effective method of cross-validation. We compare the efficacy of this cross validation method and that of information criteria recommended for use in penalized LMMs, and provide recommendations for data preprocessing and penalty parameter selection. We demonstrate the benefits of our recommendations using both a general simulation framework and one specific to genetic data. We conclude with a detailed analysis of a large, empirical GWAS data set which contains complex sample structure. We use this analysis to illustrate the benefits and potential pitfalls of penalized LMMs compared to traditional GWAS methods, and to demonstrate the utility of penalizedLMM, an R package we have developed for the flexible, and user-friendly implementation of penalized LMMs.

Genetics

public abstract

Details

Title: Subtitle: Penalized linear mixed models for structured genetic data
Creators: Anna C Reisetter
Contributors: Patrick Breheny (Advisor)
Michael Jones (Committee Member)
Jacob Michaelson (Committee Member)
Kelli Ryckman (Committee Member)
Kai Wang (Committee Member)
Resource Type: Dissertation
Degree Awarded: Doctor of Philosophy (PhD), University of Iowa
Degree in: Biostatistics
Date degree season: Summer 2021
DOI: 10.17077/etd.005994
Publisher: University of Iowa
Number of pages: xi, 115 pages
Language: English
Description illustrations: illustrations (chiefly color)
Description bibliographic: Includes bibliographical references (pages 108-115)
Public Abstract (ETD): Genetic association studies have enhanced our understanding of the genetic basis of quantitative traits and disease. To that end, accurately identifying genotype-phenotype associations is of critical importance. Such associations may be used in a myriad of medical and scientific research including drug discovery, predictive models of disease, and the development of genetic risk scores. Penalized regression methods are a valuable tool with which to identify such associations when the number of variables exceeds the number of observations, as is common in genetic data. However, these methods face added complexity when applied to the analysis of GWAS data, which is often subject to relatedness and unobserved environmental effects. These factors result in complex sample structures, which, when unaccounted for, hinder analysis.

Penalized linear mixed models (LMMs) have been developed to accurately identify genotype-phenotype associations in the presence of dependent samples. In spite of this, the statistical properties of these models are not well understood. In addition, there is a lack of available software for their implementation. The first objective of this dissertation is to provide a detailed review of penalized LMMs for the analysis of structured genetic data, while examining their statistical properties in the genetic association setting. Second, we consider the statistical properties of penalized LMMs in a general setting, and provide recommendations for key components of their implementation, including appropriate data preprocessing. We demonstrate the benefits of our recommendations using both a general setting, and one specific to genetic data. We conclude with a detailed analysis of a large, empirical GWAS data set which contains complex correlation among samples. We use this analysis to illustrate the benefits and potential pitfalls of penalized LMMs compared to traditional GWAS methods, and to demonstrate the utility of penalizedLMM, an R package we have developed for the flexible, and user-friendly implementation of penalized LMMs.
Academic Unit: Biostatistics
Record Identifier: 9984124172902771

Metrics

4 File views/ downloads

52 Record Views