An Empirical Evaluation of Large Language Models on Consumer Health Questions

Moaiz Abrar; Yusuf Sermet; Ibrahim Demir

doi:10.48550/arxiv.2501.00208

Back

An Empirical Evaluation of Large Language Models on Consumer Health Questions

Preprint

Open access

An Empirical Evaluation of Large Language Models on Consumer Health Questions

Moaiz Abrar, Yusuf Sermet and Ibrahim Demir

ArXiV.org

Cornell University

12/30/2024

DOI: 10.48550/arxiv.2501.00208

Files and links (1)

url

https://doi.org/10.48550/arxiv.2501.00208View

Preprint (Author's original)This preprint has not been evaluated by subject experts through peer review. Preprints may undergo extensive changes and/or become peer-reviewed journal articles. Open Access

Abstract

This study evaluates the performance of several Large Language Models (LLMs) on MedRedQA, a dataset of consumer-based medical questions and answers by verified experts extracted from the AskDocs subreddit. While LLMs have shown proficiency in clinical question answering (QA) benchmarks, their effectiveness on real-world, consumer-based, medical questions remains less understood. MedRedQA presents unique challenges, such as informal language and the need for precise responses suited to non-specialist queries. To assess model performance, responses were generated using five LLMs: GPT-4o mini, Llama 3.1: 70B, Mistral-123B, Mistral-7B, and Gemini-Flash. A cross-evaluation method was used, where each model evaluated its responses as well as those of others to minimize bias. The results indicated that GPT-4o mini achieved the highest alignment with expert responses according to four out of the five models' judges, while Mistral-7B scored lowest according to three out of five models' judges. This study highlights the potential and limitations of current LLMs for consumer health medical question answering, indicating avenues for further development.

Computer Science - Artificial Intelligence

Computer Science - Computation and Language

Details

Title: Subtitle: An Empirical Evaluation of Large Language Models on Consumer Health Questions
Creators: Moaiz Abrar
Yusuf Sermet
Ibrahim Demir
Resource Type: Preprint
Publication Details: ArXiV.org
DOI: 10.48550/arxiv.2501.00208
ISSN: 2331-8422
Publisher: Cornell University; Ithaca, New York
Language: English
Date posted: 12/30/2024
Academic Unit: Electrical and Computer Engineering; IIHR--Hydroscience and Engineering; Injury Prevention Research Center
Record Identifier: 9984770787202771

Metrics

20 Record Views