A new class of information criteria for improved prediction in the presence of training/validation data heterogeneity

Javier E Flores

doi:10.17077/etd.006094

Back

A new class of information criteria for improved prediction in the presence of training/validation data heterogeneity

Dissertation

Open access

A new class of information criteria for improved prediction in the presence of training/validation data heterogeneity

Javier E Flores

University of Iowa

Doctor of Philosophy (PhD), University of Iowa

Spring 2021

DOI: 10.17077/etd.006094

Files and links (1)

pdf

Flores-A New Class of Information Criteria17.63 MBDownload View

Free to read and download, Open Access

Abstract

Information criteria provide a solution to the challenge of identifying models that best optimize between the competing interests of goodness-of-fit and parsimony. Models that better conform to a dataset are often more complex, yet such models are plagued by greater variability in estimation and prediction. Conversely, overly simplistic models trade losses in bias for improvements in variability.Asymptotically efficient information criteria are those that, for large samples, select the fitted candidate model whose predictors minimize the mean squared prediction error, optimizing between prediction bias and variability. In the context of prediction, asymptotically efficient criteria are thus a preferred tool for model selection. Among these criteria, the Akaike information criterion (AIC) is perhaps the most widely used and its development may be motivated from a predictive standpoint. However, both the general property of asymptotic efficiency and the predictive motivation that underlies AIC requires that one assumes a panel of validation data that is generated independently from, but identically to, the set of training data used to fit each candidate model. We argue that the assumption of identically distributed training and validation data is misaligned with the premise of prediction and often violated in practice. This is most apparent in a regression context, where assuming training/validation data homogeneity requires identical panels of regressors. This dissertation thus develops a new class of predictive information criteria (PIC) that do not require the assumption of training/validation data homogeneity and are shown to generalize AIC to the more relevant case of training/validation data heterogeneity. We analytically and empirically explore the PIC under several commonly used modeling frameworks, demonstrating their improved capabilities over select existing criteria when predicting new, incomplete, and latent validation data.

Akaike information criterion

Bias-Variability Tradeoff

Kullback discrepancy

Model Selection

Predictive Modeling

Details

Title: Subtitle: A new class of information criteria for improved prediction in the presence of training/validation data heterogeneity
Creators: Javier E Flores
Contributors: Joseph E. Cavanaugh (Advisor)
Andrew A. Neath (Committee Member)
Gideon K. D. Zamba (Committee Member)
Jacob Oleson (Committee Member)
Hyunkeun Cho (Committee Member)
Resource Type: Dissertation
Degree Awarded: Doctor of Philosophy (PhD), University of Iowa
Degree in: Biostatistics
Date degree season: Spring 2021
DOI: 10.17077/etd.006094
Publisher: University of Iowa
Number of pages: xix, 176 pages
Language: English
Description illustrations: illustrations (some color)
Description bibliographic: Includes bibliographical references (pages 173-176).
Public Abstract (ETD): In an era where data-driven reasoning is paramount, predictive modeling is an important tool for leveraging data to extract the actionable insights that drive scientific innovation. However, given an abundance of data and modeling techniques, the enterprising analyst is often faced with the challenge of identifying optimal ways of modeling these data to distill from them the knowledge that they convey. One answer to this challenge is found in model selection (i.e. information) criteria.

Model selection criteria allow for the rank-ordering of a candidate collection of models according to a joint measure of their complexity and fidelity to the underlying data. Generally speaking, predictive models that are overly complex yield predictions that are less systematically biased but highly imprecise, varying wildly with even the slightest changes to the data. Conversely, models that are too simplistic insufficiently characterize the underlying data and yield biased (but more precise) predictions that could be far from the truth. Thus the use of model selection criteria allows one to identify the model that best balances along the complexity/simplicity spectrum and yields predictions that are neither inconsistent nor systematically too high or too low.

This dissertation introduces a new class of selection criteria that improve upon the abilities of existing criteria in selecting models that best strike the bias/variability balance. Currently available criteria rely on the assumption that the target of prediction (i.e. validation data) and the data used to construct each model (i.e. training data) follow identical distributions. This assumption is clearly misaligned with the premise of prediction, where one desires to predict a set of new data that is likely characteristically different from the data at hand. Recognizing this disconnect, the set of model selection criteria introduced by this thesis lead to the selection of good predictive models regardless of the relationship between the validation and training datasets. We demonstrate the utility of our criteria across a variety of popular modeling frameworks and predictive scenarios, and we compare their performance to a subset of widely-implemented selection criteria.
Academic Unit: Biostatistics
Record Identifier: 9984097169902771

Metrics

20 File views/ downloads

316 Record Views