Implementing machine learning methods with complex survey data: Lessons learned on the impacts of accounting sampling weights in gradient boosting

Nathaniel MacNell; Lydia Feinstein; Jesse Wilkerson; Pӓivi M Salo; Samantha A Molsberry; Michael B Fessler; Peter S Thorne; Alison A Motsinger-Reif; Darryl C Zeldin

doi:10.1371/journal.pone.0280387

Back

Implementing machine learning methods with complex survey data: Lessons learned on the impacts of accounting sampling weights in gradient boosting

Journal article

Open access

Peer reviewed

Implementing machine learning methods with complex survey data: Lessons learned on the impacts of accounting sampling weights in gradient boosting

Nathaniel MacNell, Lydia Feinstein, Jesse Wilkerson, Pӓivi M Salo, Samantha A Molsberry, Michael B Fessler, Peter S Thorne, Alison A Motsinger-Reif and Darryl C Zeldin

PloS one, Vol.18(1), e0280387

2023

DOI: 10.1371/journal.pone.0280387

PMCID: PMC9838837

PMID: 36638125

Files and links (1)

url

https://doi.org/10.1371/journal.pone.0280387View

Published (Version of record) Open Access

Abstract

Despite the prominent use of complex survey data and the growing popularity of machine learning methods in epidemiologic research, few machine learning software implementations offer options for handling complex samples. A major challenge impeding the broader incorporation of machine learning into epidemiologic research is incomplete guidance for analyzing complex survey data, including the importance of sampling weights for valid prediction in target populations. Using data from 15, 820 participants in the 1988-1994 National Health and Nutrition Examination Survey cohort, we determined whether ignoring weights in gradient boosting models of all-cause mortality affected prediction, as measured by the F1 score and corresponding 95% confidence intervals. In simulations, we additionally assessed the impact of sample size, weight variability, predictor strength, and model dimensionality. In the National Health and Nutrition Examination Survey data, unweighted model performance was inflated compared to the weighted model (F1 score 81.9% [95% confidence interval: 81.2%, 82.7%] vs 77.4% [95% confidence interval: 76.1%, 78.6%]). However, the error was mitigated if the F1 score was subsequently recalculated with observed outcomes from the weighted dataset (F1: 77.0%; 95% confidence interval: 75.7%, 78.4%). In simulations, this finding held in the largest sample size (N = 10,000) under all analytic conditions assessed. For sample sizes <5,000, sampling weights had little impact in simulations that more closely resembled a simple random sample (low weight variability) or in models with strong predictors, but findings were inconsistent under other analytic scenarios. Failing to account for sampling weights in gradient boosting models may limit generalizability for data from complex surveys, dependent on sample size and other analytic properties. In the absence of software for configuring weighted algorithms, post-hoc re-calculations of unweighted model performance using weighted observed outcomes may more accurately reflect model prediction in target populations than ignoring weights entirely.

Algorithms

Machine Learning

Software

Humans

Nutrition Surveys

Surveys and Questionnaires

Details

Title: Subtitle: Implementing machine learning methods with complex survey data: Lessons learned on the impacts of accounting sampling weights in gradient boosting
Creators: Nathaniel MacNell - Social & Scientific Systems, a DLH Holdings Company, Durham, North Carolina, United States of America.
Lydia Feinstein - University of North Carolina at Chapel Hill
Jesse Wilkerson - Social & Scientific Systems, a DLH Holdings Company, Durham, North Carolina, United States of America.
Pӓivi M Salo - National Institute of Environmental Health Sciences
Samantha A Molsberry - Social & Scientific Systems, a DLH Holdings Company, Durham, North Carolina, United States of America.
Michael B Fessler - National Institute of Environmental Health Sciences
Peter S Thorne - University of Iowa
Alison A Motsinger-Reif - National Institute of Environmental Health Sciences
Darryl C Zeldin - National Institute of Environmental Health Sciences
Resource Type: Journal article
Publication Details: PloS one, Vol.18(1), e0280387
DOI: 10.1371/journal.pone.0280387
PMID: 36638125
PMCID: PMC9838837
NLM abbreviation: PLoS One
ISSN: 1932-6203
eISSN: 1932-6203
Grant note: DOI: 10.13039/100000066, name: National Institute of Environmental Health Sciences, award: Z01 ES025041; DOI: 10.13039/100000066, name: National Institute of Environmental Health Sciences, award: Z01 ES102005; DOI: 10.13039/100000066, name: National Institute of Environmental Health Sciences, award: HHSN273201600002I Social & Scientific Systems
Language: English
Date published: 2023
Academic Unit: Civil and Environmental Engineering; Occupational and Environmental Health
Record Identifier: 9984360034602771

Metrics

28 Record Views

24 Times Cited - Web of Science