An Empirical Evaluation of Large Language Models on Consumer Health Questions

Moaiz Abrar; Yusuf Sermet; Ibrahim Demir

doi:10.3390/biomedinformatics5010012

Back

An Empirical Evaluation of Large Language Models on Consumer Health Questions

Journal article

Peer reviewed

An Empirical Evaluation of Large Language Models on Consumer Health Questions

Moaiz Abrar, Yusuf Sermet and Ibrahim Demir

BioMedInformatics, Vol.5(1), p.12

02/27/2025

DOI: 10.3390/biomedinformatics5010012

Files and links (1)

url

https://doi.org/10.3390/biomedinformatics5010012View

Published (Version of record) Open Access

Abstract

Background: Large Language Models (LLMs) have demonstrated strong performances in clinical question-answering (QA) benchmarks, yet their effectiveness in addressing real-world consumer medical queries remains underexplored. This study evaluates the capabilities and limitations of LLMs in answering consumer health questions using the MedRedQA dataset, which consists of medical questions and answers by verified experts from the AskDocs subreddit. Methods: Five LLMs-GPT-4o mini, Llama 3.1-70B, Mistral-123B, Mistral-7B, and Gemini-Flash were assessed using a cross-evaluation framework. Each model generated responses to consumer queries and their outputs were evaluated by every model by comparing them with expert responses. Human evaluation was used to assess the reliability of models as evaluators. Results: GPT-4o mini achieved the highest alignment with expert responses according to four out of the five models' judges, while Mistral-7B scored the lowest according to three out of five models' judges. Overall, model responses show low alignment with expert responses. Conclusions: Current small or medium sized LLMs struggle to provide accurate answers to consumer health questions and must be significantly improved.

Life Sciences & Biomedicine

Mathematical & Computational Biology

Medical Informatics

Science & Technology

Details

Title: Subtitle: An Empirical Evaluation of Large Language Models on Consumer Health Questions
Creators: Moaiz Abrar - Univ Iowa, IIHR Hydrosci & Engn, Iowa City, IA 52246 USA
Yusuf Sermet - University of Iowa
Ibrahim Demir - Tulane University
Resource Type: Journal article
Publication Details: BioMedInformatics, Vol.5(1), p.12
DOI: 10.3390/biomedinformatics5010012
ISSN: 2673-7426
eISSN: 2673-7426
Publisher: Mdpi
Number of pages: 16
Language: English
Date published: 02/27/2025
Academic Unit: Electrical and Computer Engineering; Civil and Environmental Engineering; IIHR--Hydroscience and Engineering; Injury Prevention Research Center
Record Identifier: 9985132189502771

Metrics

1 Record Views

See more details