Performance of Multimodal Generative AI Models in Addressing Complex Dental Inquiries With Text, Images, and Analytical Data

Hang-Nga Mai; Du-Hyeong Lee; Jekita Kaenploy; Jong-Eun Kim; Seok-Hwan Cho

doi:10.1111/jerd.70064

Back

Performance of Multimodal Generative AI Models in Addressing Complex Dental Inquiries With Text, Images, and Analytical Data

Journal article

Open access

Peer reviewed

Performance of Multimodal Generative AI Models in Addressing Complex Dental Inquiries With Text, Images, and Analytical Data

Hang-Nga Mai, Du-Hyeong Lee, Jekita Kaenploy, Jong-Eun Kim and Seok-Hwan Cho

Journal of esthetic and restorative dentistry, Vol.38(1), pp.166-172

01/2026

DOI: 10.1111/jerd.70064

PMID: 41287924

Files and links (1)

url

https://doi.org/10.1111/jerd.70064View

Published (Version of record) Open Access

Abstract

Multimodal large language models (LLMs) have the potential to transform dental learning and decision-making by addressing multimodal dental inquiries that integrate text, images, and analytical data. The purpose of this study was to evaluate the performance of various multimodal LLMs in responding to multimodal dental queries and to identify factors influencing their performance. Four multimodal LLMs (ChatGPT-4V, Claude 3 Sonnet, Microsoft 365 Copilot 2024, and Google Gemini 1.5 Pro) were evaluated based on their correct answers and passing margin for the Integrated National Board Dental Examination (INBDE) and the Advanced Dental Admission Test (ADAT). Descriptive statistics, χ tests, Cohen's κ, Kruskal-Wallis tests, and Mann-Whitney U tests were used to analyze the performance across different question types, independent inputs, and picture types (α = 0.05). Claude 3 Sonnet outperformed the other models in both INBDE and ADAT exams, achieving the highest accuracy, followed by ChatGPT-4V, Microsoft 365 Copilot 2024, and Google Gemini 1.5 Pro. χ tests revealed significant differences between chatbots in the ADAT exam, but not in the INBDE. Cohen's κ showed weak to moderate model agreement for INBDE and stronger agreement for ADAT, with the highest agreement between Claude 3 Sonnet and ChatGPT-4V (κ = 0.757) and the lowest between Google Gemini 1.5 Pro and Microsoft 365 Copilot 2024 (κ = 0.059). Model performance was influenced by question type (theoretical and clinical), with common errors including misinterpreting clinical scenarios, visual data difficulties, and dental terminology ambiguities. Multimodal LLMs show potential in answering multimodal dental inquiries, though performance varies across models, with challenges in interpreting clinical scenarios, visual data, and terminology ambiguity. Large language models canbe applied not only to memorization-type but also interpretation andproblem-solving cognitive questions in dentistry. Tomaximize the utility of these artificial intelligence models, users need bothan understanding of their differences and the ability to manage complexclinical data.

exam

performance

dental inquiry

large language model

generative artificial intelligence

Details

Title: Subtitle: Performance of Multimodal Generative AI Models in Addressing Complex Dental Inquiries With Text, Images, and Analytical Data
Creators: Hang-Nga Mai - Kyungpook National University
Du-Hyeong Lee - University of Iowa
Jekita Kaenploy - University of Oklahoma Health Sciences Center
Jong-Eun Kim - Yonsei University College of Dentistry
Seok-Hwan Cho - University of Iowa, Prosthodontics
Resource Type: Journal article
Publication Details: Journal of esthetic and restorative dentistry, Vol.38(1), pp.166-172
DOI: 10.1111/jerd.70064
PMID: 41287924
NLM abbreviation: J Esthet Restor Dent
ISSN: 1708-8240
eISSN: 1708-8240
Publisher: Wiley
Language: English
Electronic publication date: 11/25/2025
Date published: 01/2026
Academic Unit: Prosthodontics
Record Identifier: 9985034032802771

Metrics

24 Record Views