Representative random sampling: an empirical evaluation of a novel bin stratification method for model performance estimation

Michael C. Rendleman; Brian J. Smith; Guadalupe Canahuate; Terry A. Braun; John M. Buatti; Thomas L. Casavant

doi:10.1007/s11222-022-10138-7

Back

Representative random sampling: an empirical evaluation of a novel bin stratification method for model performance estimation

Journal article

Open access

Peer reviewed

Representative random sampling: an empirical evaluation of a novel bin stratification method for model performance estimation

Michael C. Rendleman, Brian J. Smith, Guadalupe Canahuate, Terry A. Braun, John M. Buatti and Thomas L. Casavant

Statistics and computing, Vol.32(6), 101

2022

DOI: 10.1007/s11222-022-10138-7

Files and links (1)

url

https://doi.org/10.1007/s11222-022-10138-7View

Published (Version of record) Open Access

Abstract

High-dimensional cancer data can be burdensome to analyze, with complex relationships between molecular measurements, clinical diagnostics, and treatment outcomes. Data-driven computational approaches may be key to identifying relationships with potential clinical or research use. To this end, reliable comparison of feature engineering approaches in their ability to support machine learning survival modeling is crucial. With the limited number of cases often present in multi-omics datasets (“big p , little n ,” or many features, few subjects), a resampling approach such as cross validation (CV) would provide robust model performance estimates at the cost of flexibility in intermediate assessments and exploration in feature engineering approaches. A holdout (HO) estimation approach, however, would permit this flexibility at the expense of reliability. To provide more reliable HO-based model performance estimates, we propose a novel sampling procedure: representative random sampling (RRS). RRS is a special case of continuous bin stratification which minimizes significant relationships between random HO groupings (or CV folds) and a continuous outcome. Monte Carlo simulations used to evaluate RRS on synthetic molecular data indicated that RRS-based HO (RRHO) yields statistically significant reductions in error and bias when compared with standard HO. Similarly, more consistent reductions are observed with RRS-based CV. While resampling approaches are the ideal choice for performance estimation with limited data, RRHO can enable more reliable exploratory feature engineering than standard HO.

Computer Science

Artificial Intelligence

Probability and Statistics in Computer Science

Statistical Theory and Methods

Statistics and Computing/Statistics Programs

Details

Title: Subtitle: Representative random sampling: an empirical evaluation of a novel bin stratification method for model performance estimation
Creators: Michael C. Rendleman - University of Iowa
Brian J. Smith - University of Iowa
Guadalupe Canahuate - University of Iowa
Terry A. Braun - University of Iowa
John M. Buatti - University of Iowa
Thomas L. Casavant - University of Iowa
Resource Type: Journal article
Publication Details: Statistics and computing, Vol.32(6), 101
DOI: 10.1007/s11222-022-10138-7
ISSN: 0960-3174
eISSN: 1573-1375
Publisher: Springer US
Language: English
Date published: 2022
Academic Unit: Roy J. Carver Department of Biomedical Engineering; Electrical and Computer Engineering; Biostatistics; Radiation Oncology; Neurosurgery; Otolaryngology; Holden Comprehensive Cancer Center
Record Identifier: 9984306832902771

Metrics

16 Record Views

6 Times Cited - Web of Science