Using data preprocessing techniques and machine learning algorithms to explore predictors of word difficulty in English language assessment

Mingying Zheng

doi:10.25820/etd.007627

Back

Using data preprocessing techniques and machine learning algorithms to explore predictors of word difficulty in English language assessment

Dissertation

Open access

Using data preprocessing techniques and machine learning algorithms to explore predictors of word difficulty in English language assessment

Mingying Zheng

University of Iowa

Doctor of Philosophy (PhD), University of Iowa

Summer 2024

DOI: 10.25820/etd.007627

Files and links (1)

pdf

Submittable_Zheng_Dissertation_July05_20241.96 MBDownload View

Free to read and download, Open Access

Abstract

The digital transformation in educational assessment has led to the proliferation of large-scale data, offering unprecedented opportunities to enhance language learning, and testing through machine learning (ML) techniques. Drawing on the extensive data generated by online English language assessments, this dissertation investigates the efficacy of data preprocessing techniques and their impacts on the performance of ten machine learning classifiers. Two preprocessing sequences were examined: Form A (data cleaning, data transformation, then data reduction) and Form B (data cleaning, data reduction, then data transformation), in the quest to enhance data quality for the application of supervised machine learning algorithms in English language assessments. The current study rigorously evaluated the accuracy, precision, recall, F1-score, and AUC metrics of ten machine learning classifiers on their ability to accurately predict word difficulty in a comprehensive dataset from large-scale English language assessments involving 3,918 test takers and 6,599 words characterized by 38 different lexical and form related features, with a particular focus on eXtreme Gradient Boosting (XGB), Decision Tree, and Random Forest, determining their capacity to generalize well to new, unseen structured data. The results underscore that both data preprocessing sequences enhance supervised machine learning classifier performance comparably, suggesting the choice between two data preprocessing techniques may depend on other factors such as computational resources and desired interpretability. Among all ten machine learning classifiers, the XGB classifier consistently outperformed other classifiers, indicating its robustness and suitability for processing large-scale educational data. A significant contribution of this research study lies in identifying key lexical features—such as word frequency, average lexical decision accuracy of all participants for a given word, standardized lexical decision accuracy reaction time across all participants for a given word, reported age of acquisition score, neighbors determined using phonological Levenshtein distance), raw corpus frequency, and dispersion for a given word —that are predictive of word difficulty. These findings are critical for English as a second language (ESL) educational contexts, where they can inform the development of more effective teaching materials and assessments. This study not only advances the field of educational data analytics by exploring the intersection of data preprocessing and machine learning but also lays the groundwork for future research to further refine these approaches in the context of language assessment.

Machine Learning

data preprocessing

Data quality

Educational assessment

Supervised machine learning algorithms

Word difficulty

Details

Title: Subtitle: Using data preprocessing techniques and machine learning algorithms to explore predictors of word difficulty in English language assessment
Creators: Mingying Zheng
Contributors: Jonathan Templin (Advisor)
Aloe Ariel (Committee Member)
Lesa Hoffman (Committee Member)
Wan-Chan Lee (Committee Member)
Resource Type: Dissertation
Degree Awarded: Doctor of Philosophy (PhD), University of Iowa
Degree in: Psychological and Quantitative Foundations (Educational Measurement and Statistics)
Date degree season: Summer 2024
Publisher: University of Iowa
DOI: 10.25820/etd.007627
Number of pages: xii, 123 pages
Language: English
Date submitted: 07/05/2024
Description illustrations: illustrations, tables, graphs
Description bibliographic: Includes bibliographical references (pages 102-112).
Public Abstract (ETD): In our digital age, English language tests are increasingly moving online, generating vast amounts of data. This study delves into the preprocessing of this raw data for further analyses, which is crucial for understanding and improving these tests. Imagine tidying up a room before you can appreciate its full potential; similarly, data must be cleaned and organized. This research explored two methods of doing so, one focusing on thorough cleaning and detailed organization, while the other took a more streamlined approach. The goal was to determine which method better enhances the data’s usefulness. Then, like picking the right tools to extract insights from the data, the study examined different automated learning systems (think of them as smart, self-learning algorithms) to find out which could most accurately predict the difficulty of English words for learners. The three top-performing systems were identified: one that showed consistent accuracy, and two that excelled at identifying correct answers. Furthermore, the study identified the most telling features—like word frequency and reported age of learning a given word—that indicate a word’s difficulty level. These insights could aid in designing better learning materials. In summary, this research showed that careful preparation of data leads to more accurate analyses, regardless of the method used. It also sheds light on how to effectively get data ready for study and which tools are best for extracting valuable insights. This helps not only in understanding language learning better but also in designing teaching tools that can adapt to different learners’ needs.
Academic Unit: Psychological and Quantitative Foundations
Record Identifier: 9984698250502771

Metrics

3 Record Views