Logo image
Evaluating Rater Effects of Large Language Models in Automated Essay Scoring: GPT, Claude, Gemini, and DeepSeek
Journal article   Peer reviewed

Evaluating Rater Effects of Large Language Models in Automated Essay Scoring: GPT, Claude, Gemini, and DeepSeek

Hong Jiao, Dan Song and Won‐Chan Lee
Educational measurement, issues and practice, Vol.45(2), e70018
06/2026
DOI: 10.1111/emip.70018

View Online

Abstract

Large language models (LLMs) have been widely explored for automated scoring in educational assessment to facilitate learning and instruction. However, empirical evidence regarding which LLMs produce the most reliable scores and induce the least rater effects remains limited. This study compared 10 LLMs (ChatGPT 3.5, ChatGPT 4, ChatGPT 4o, OpenAI o1, Claude 3.5 Sonnet, Gemini 1.5, Gemini 1.5 Pro, Gemini 2.0, DeepSeek V3, and DeepSeek R1) with human expert raters in scoring two types of writing tasks. Their performance was evaluated in terms of score accuracy, intra‐rater consistency, and rater effects estimated using the Many‐Facet Rasch model. Although the results generally supported the use of ChatGPT 4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet with high scoring accuracy, better intra‐rater consistency, and less rater effects, the study is not intended to support substantive comparisons or rankings of LLMs or to identify a single “best” model, given the small sample size.

Details

Logo image