Journal article
An Empirical Evaluation of Large Language Models on Consumer Health Questions
BioMedInformatics, Vol.5(1), p.12
02/27/2025
DOI: 10.3390/biomedinformatics5010012
Abstract
Background: Large Language Models (LLMs) have demonstrated strong performances in clinical question-answering (QA) benchmarks, yet their effectiveness in addressing real-world consumer medical queries remains underexplored. This study evaluates the capabilities and limitations of LLMs in answering consumer health questions using the MedRedQA dataset, which consists of medical questions and answers by verified experts from the AskDocs subreddit. Methods: Five LLMs-GPT-4o mini, Llama 3.1-70B, Mistral-123B, Mistral-7B, and Gemini-Flash were assessed using a cross-evaluation framework. Each model generated responses to consumer queries and their outputs were evaluated by every model by comparing them with expert responses. Human evaluation was used to assess the reliability of models as evaluators. Results: GPT-4o mini achieved the highest alignment with expert responses according to four out of the five models' judges, while Mistral-7B scored the lowest according to three out of five models' judges. Overall, model responses show low alignment with expert responses. Conclusions: Current small or medium sized LLMs struggle to provide accurate answers to consumer health questions and must be significantly improved.
Details
- Title: Subtitle
- An Empirical Evaluation of Large Language Models on Consumer Health Questions
- Creators
- Moaiz Abrar - Univ Iowa, IIHR Hydrosci & Engn, Iowa City, IA 52246 USAYusuf Sermet - University of IowaIbrahim Demir - Tulane University
- Resource Type
- Journal article
- Publication Details
- BioMedInformatics, Vol.5(1), p.12
- DOI
- 10.3390/biomedinformatics5010012
- ISSN
- 2673-7426
- eISSN
- 2673-7426
- Publisher
- Mdpi
- Number of pages
- 16
- Language
- English
- Date published
- 02/27/2025
- Academic Unit
- Electrical and Computer Engineering; Civil and Environmental Engineering; IIHR--Hydroscience and Engineering; Injury Prevention Research Center
- Record Identifier
- 9985132189502771
Metrics
1 Record Views