Journal article
Can Multimodal Large Language Models Diagnose Diabetic Retinopathy from Fundus Photos? A Quantitative Evaluation
Ophthalmology science (Online), Vol.6(1), 100911
01/2026
DOI: 10.1016/j.xops.2025.100911
PMCID: PMC12478077
PMID: 41030829
Abstract
Objective
To evaluate the diagnostic accuracy of four multimodal large language models (MLLMs) in detecting and grading diabetic retinopathy (DR) using their new image analysis features.
Design
Single-center retrospective study
Subjects
Patients diagnosed with pre-diabetes and diabetes
Methods
Ultrawide field (UWF) fundus images from patients seen at the University of California San Diego were graded for DR severity by three retina specialists using the Early Treatment Diabetic Retinopathy Study (ETDRS) classification system to establish ground truth. Four MLLMs (ChatGPT-4o, Claude 3.5 Sonnet, Google Gemini 1.5 Pro, and Perplexity Llama 3.1 Sonar/Default) were tested using four distinct prompts. These assessed multiple choice disease diagnosis, binary disease classification, and disease severity. MLLMs were assessed for accuracy, sensitivity, and specificity in identifying the presence or absence of DR, and relative disease severity.
Main Outcome Measures
Accuracy, sensitivity, and specificity of diagnosis
Results
A total of 309 eyes from 188 patients were included in the study. Average patient age was 58.7 (56.7, 60.7) years, with 55.3% being female. After specialist grading, 70.2% of eyes had DR of varying severity and 29.8% had no DR. For disease identification with multiple choices provided, Claude and ChatGPT scored significantly higher (P < 0.0006, per Bonferroni correction) than other MLLMs for accuracy (0.608, 0.566) and sensitivity (0.618, 0.641). In binary DR versus No DR classification, accuracy was highest for ChatGPT (0.644) and Perplexity (0.602). Sensitivity varied [ChatGPT (0.539), Perplexity (0.488), Claude (0.179), and Gemini (0.042)], while specificity for all models was relatively high (range: 0.870 - 0.989). For the DR severity prompt with the best overall results (Prompt 3.1), no significant differences between models were found in accuracy [Perplexity (0.411), ChatGPT (0.395), Gemini (0.392), Claude (0.314)]. All models demonstrated low sensitivity [Perplexity (0.247), ChatGPT (0.229), Gemini (0.224), Claude (0.184)]. Specificity ranged from 0.840 to 0.866.
Conclusion
MLLMs are powerful tools which may eventually assist retinal image analysis. Currently, however, there is variability in the accuracy of image analysis, and diagnostic performance falls short of clinical standards for safe implementation in diabetic retinopathy diagnosis and grading. Further training and optimization of common errors may enhance their clinical utility.
Details
- Title: Subtitle
- Can Multimodal Large Language Models Diagnose Diabetic Retinopathy from Fundus Photos? A Quantitative Evaluation
- Creators
- Jesse A. Most - University of California San DiegoEvan H. Walker - University of California San DiegoNehal N. Mehta - University of California San DiegoInes D. Nagel - University of California San DiegoJimmy S. Chen - University of California San DiegoJonathan F. Russell - University of IowaNathan L. Scott - University of California San DiegoShyamanga Borooah - University of California San Diego
- Resource Type
- Journal article
- Publication Details
- Ophthalmology science (Online), Vol.6(1), 100911
- DOI
- 10.1016/j.xops.2025.100911
- PMID
- 41030829
- PMCID
- PMC12478077
- NLM abbreviation
- Ophthalmol Sci
- ISSN
- 2666-9145
- eISSN
- 2666-9145
- Publisher
- ELSEVIER
- Language
- English
- Electronic publication date
- 08/2025
- Date published
- 01/2026
- Academic Unit
- Ophthalmology and Visual Sciences
- Record Identifier
- 9984946696602771
Metrics
26 Record Views