Developing a generic scorer for practice writing tests of statewide assessment essays with natural language processing transfer learning techniques

Yi Gui

doi:10.25820/etd.007612

Back

Developing a generic scorer for practice writing tests of statewide assessment essays with natural language processing transfer learning techniques

Dissertation

Open access

Developing a generic scorer for practice writing tests of statewide assessment essays with natural language processing transfer learning techniques

Yi Gui

University of Iowa

Doctor of Philosophy (PhD), University of Iowa

Summer 2024

DOI: 10.25820/etd.007612

Files and links (1)

pdf

YiGui_0715_final_rename5.78 MBDownload View

Free to read and download, Open Access

Abstract

This study explores using transfer learning in machine learning for natural language processing (NLP) to create generic automated essay scoring (AES) models, providing instant online scoring for statewide writing assessments in K-12 education. The goal is to develop an instant online scorer that is generalizable to any prompt, addressing the current limitations of online writing practice tests for operational assessments, such as those for statewide writing assessment (SWAS). The study leverages Google’s BERT, a state-of-the-art NLP transfer learning AI product, to train and build generic essay scoring models which are based on statistical ground of ordinal logistic regression (OLR) via machine learning. Three groups were analyzed: a control group with no additional pre-training, a group further pre-trained on 12,970 ASAP essays (in-domain materials), and a group further pre-trained on 500 SWAS essays (within-task materials). Models were trained with 9th- and 11th-grade SWAS essays and evaluated on 10th-grade essays. Model evaluation metrics included Quadratic Weighted Kappa (QWK), Mean Absolute Errors (MAEs), accuracy, precision, recall, and F1-score. Results indicated that further pre-training does not necessarily enhance scoring performance. The control group often matched or exceeded the SWAS pre-trained group. In all three groups, the scoring patterns of Language Use are consistent, while the ASAP pre-trained group excelled in scoring the Prompt Task trait. These findings highlight the importance of the pre-training materials’ quality and indicate that further pre-training does not necessarily improve model performance for distinct downstream tasks. Future research should use balanced datasets with more essay prompts and explore different experimental designs to find the prerequisites of further pre-training to improve the AES model performance.

Machine Learning

Educational Evaluation

artificial intelligence

Automated Essay Scoring

BERT

natural language processing

transfer learning

Details

Title: Subtitle: Developing a generic scorer for practice writing tests of statewide assessment essays with natural language processing transfer learning techniques
Creators: Yi Gui
Contributors: Catherine Welch (Advisor)
Deborah Harris (Advisor)
Stephen Dunbar (Committee Member)
Chao Wang (Committee Member)
Resource Type: Dissertation
Degree Awarded: Doctor of Philosophy (PhD), University of Iowa
Degree in: Psychological and Quantitative Foundations (Educational Measurement and Statistics)
Date degree season: Summer 2024
Publisher: University of Iowa
DOI: 10.25820/etd.007612
Number of pages: ix, 143 pages
Language: English
Date submitted: 07/19/2024
Description illustrations: illustrations, tables, graphs
Description bibliographic: Includes bibliographical references (pages 132-143).
Public Abstract (ETD): This study focuses on creating a new tool for grading student essays automatically, aiming to provide instant scoring for practice tests in K-12 education. Traditionally, grading essays for state assessments takes a long time because a large number of essays are needed to train computers’ automated systems accurately. This project uses advanced technology, including Google’s BERT, to make a more generalizable essay scoring tool that can work on essays of any prompts. Three different approaches were tested: one with no extra training, one with further training using a very large set of essays (ASAP), and one with further training using a smaller set of state assessment essays (SWAS). These models were evaluated based on how accurately they could grade essays from students who were not part of the initial training. The results showed that extra training did not always improve the models. Surprisingly, the model without any extra training often performed just as well, if not better, than the other two. The essay trait related to Language Use was graded most consistently across all models. The model trained with ASAP essays excelled in grading how an essay completes the task of its prompt well, but the model trained with SWAS essays did not perform as well. This research highlights that the quality of the training material is crucial. Future studies should use more balanced datasets with a variety of prompts to further test and improve these automated grading systems. The ultimate goal for the study is to provide a reliable, instant scoring tool for students’ writing practice tests, enhancing their learning experience.
Academic Unit: Psychological and Quantitative Foundations
Record Identifier: 9984698053702771

Metrics

2 File views/ downloads

8 Record Views