Evaluating Rater Effects of Large Language Models in Automated Essay Scoring: GPT, Claude, Gemini, and DeepSeek

Hong Jiao; Dan Song; Won‐Chan Lee

doi:10.1111/emip.70018

Back

Evaluating Rater Effects of Large Language Models in Automated Essay Scoring: GPT, Claude, Gemini, and DeepSeek

Journal article

Peer reviewed

Evaluating Rater Effects of Large Language Models in Automated Essay Scoring: GPT, Claude, Gemini, and DeepSeek

Hong Jiao, Dan Song and Won‐Chan Lee

Educational measurement, issues and practice, Vol.45(2), e70018

06/2026

DOI: 10.1111/emip.70018

View Online

Abstract

Large language models (LLMs) have been widely explored for automated scoring in educational assessment to facilitate learning and instruction. However, empirical evidence regarding which LLMs produce the most reliable scores and induce the least rater effects remains limited. This study compared 10 LLMs (ChatGPT 3.5, ChatGPT 4, ChatGPT 4o, OpenAI o1, Claude 3.5 Sonnet, Gemini 1.5, Gemini 1.5 Pro, Gemini 2.0, DeepSeek V3, and DeepSeek R1) with human expert raters in scoring two types of writing tasks. Their performance was evaluated in terms of score accuracy, intra‐rater consistency, and rater effects estimated using the Many‐Facet Rasch model. Although the results generally supported the use of ChatGPT 4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet with high scoring accuracy, better intra‐rater consistency, and less rater effects, the study is not intended to support substantive comparisons or rankings of LLMs or to identify a single “best” model, given the small sample size.

Details

Title: Subtitle: Evaluating Rater Effects of Large Language Models in Automated Essay Scoring: GPT, Claude, Gemini, and DeepSeek
Creators: Hong Jiao - University of Maryland, Baltimore
Dan Song - University of Iowa
Won‐Chan Lee - University of Iowa
Resource Type: Journal article
Publication Details: Educational measurement, issues and practice, Vol.45(2), e70018
DOI: 10.1111/emip.70018
ISSN: 0731-1745
eISSN: 1745-3992
Publisher: Wiley
Language: English
Date published: 06/2026
Academic Unit: Psychological and Quantitative Foundations
Record Identifier: 9985147077402771

Metrics

1 Record Views

See more details