Logo image
Can Multimodal Large Language Models Diagnose Diabetic Retinopathy from Fundus Photos? A Quantitative Evaluation
Journal article   Open access   Peer reviewed

Can Multimodal Large Language Models Diagnose Diabetic Retinopathy from Fundus Photos? A Quantitative Evaluation

Jesse A. Most, Evan H. Walker, Nehal N. Mehta, Ines D. Nagel, Jimmy S. Chen, Jonathan F. Russell, Nathan L. Scott and Shyamanga Borooah
Ophthalmology science (Online), Vol.6(1), 100911
01/2026
DOI: 10.1016/j.xops.2025.100911
PMCID: PMC12478077
PMID: 41030829
url
https://doi.org/10.1016/j.xops.2025.100911View
Published (Version of record) Open Access

Abstract

Objective To evaluate the diagnostic accuracy of four multimodal large language models (MLLMs) in detecting and grading diabetic retinopathy (DR) using their new image analysis features. Design Single-center retrospective study Subjects Patients diagnosed with pre-diabetes and diabetes Methods Ultrawide field (UWF) fundus images from patients seen at the University of California San Diego were graded for DR severity by three retina specialists using the Early Treatment Diabetic Retinopathy Study (ETDRS) classification system to establish ground truth. Four MLLMs (ChatGPT-4o, Claude 3.5 Sonnet, Google Gemini 1.5 Pro, and Perplexity Llama 3.1 Sonar/Default) were tested using four distinct prompts. These assessed multiple choice disease diagnosis, binary disease classification, and disease severity. MLLMs were assessed for accuracy, sensitivity, and specificity in identifying the presence or absence of DR, and relative disease severity. Main Outcome Measures Accuracy, sensitivity, and specificity of diagnosis Results A total of 309 eyes from 188 patients were included in the study. Average patient age was 58.7 (56.7, 60.7) years, with 55.3% being female. After specialist grading, 70.2% of eyes had DR of varying severity and 29.8% had no DR. For disease identification with multiple choices provided, Claude and ChatGPT scored significantly higher (P < 0.0006, per Bonferroni correction) than other MLLMs for accuracy (0.608, 0.566) and sensitivity (0.618, 0.641). In binary DR versus No DR classification, accuracy was highest for ChatGPT (0.644) and Perplexity (0.602). Sensitivity varied [ChatGPT (0.539), Perplexity (0.488), Claude (0.179), and Gemini (0.042)], while specificity for all models was relatively high (range: 0.870 - 0.989). For the DR severity prompt with the best overall results (Prompt 3.1), no significant differences between models were found in accuracy [Perplexity (0.411), ChatGPT (0.395), Gemini (0.392), Claude (0.314)]. All models demonstrated low sensitivity [Perplexity (0.247), ChatGPT (0.229), Gemini (0.224), Claude (0.184)]. Specificity ranged from 0.840 to 0.866. Conclusion MLLMs are powerful tools which may eventually assist retinal image analysis. Currently, however, there is variability in the accuracy of image analysis, and diagnostic performance falls short of clinical standards for safe implementation in diabetic retinopathy diagnosis and grading. Further training and optimization of common errors may enhance their clinical utility.
Diabetic retinopathy Ultra-widefield fundus photography Multimodal large language model Artificial intelligence Image analysis

Details

Metrics

26 Record Views
Logo image