Dissertation
Using data preprocessing techniques and machine learning algorithms to explore predictors of word difficulty in English language assessment
University of Iowa
Doctor of Philosophy (PhD), University of Iowa
Summer 2024
DOI: 10.25820/etd.007627
Abstract
The digital transformation in educational assessment has led to the proliferation of large-scale data, offering unprecedented opportunities to enhance language learning, and testing through machine learning (ML) techniques. Drawing on the extensive data generated by online English language assessments, this dissertation investigates the efficacy of data preprocessing techniques and their impacts on the performance of ten machine learning classifiers. Two preprocessing sequences were examined: Form A (data cleaning, data transformation, then data reduction) and Form B (data cleaning, data reduction, then data transformation), in the quest to enhance data quality for the application of supervised machine learning algorithms in English language assessments.
The current study rigorously evaluated the accuracy, precision, recall, F1-score, and AUC metrics of ten machine learning classifiers on their ability to accurately predict word difficulty in a comprehensive dataset from large-scale English language assessments involving 3,918 test takers and 6,599 words characterized by 38 different lexical and form related features, with a particular focus on eXtreme Gradient Boosting (XGB), Decision Tree, and Random Forest, determining their capacity to generalize well to new, unseen structured data.
The results underscore that both data preprocessing sequences enhance supervised machine learning classifier performance comparably, suggesting the choice between two data preprocessing techniques may depend on other factors such as computational resources and desired interpretability. Among all ten machine learning classifiers, the XGB classifier consistently outperformed other classifiers, indicating its robustness and suitability for processing large-scale educational data.
A significant contribution of this research study lies in identifying key lexical features—such as word frequency, average lexical decision accuracy of all participants for a given word, standardized lexical decision accuracy reaction time across all participants for a given word, reported age of acquisition score, neighbors determined using phonological Levenshtein distance), raw corpus frequency, and dispersion for a given word —that are predictive of word difficulty. These findings are critical for English as a second language (ESL) educational contexts, where they can inform the development of more effective teaching materials and assessments.
This study not only advances the field of educational data analytics by exploring the intersection of data preprocessing and machine learning but also lays the groundwork for future research to further refine these approaches in the context of language assessment.
Details
- Title: Subtitle
- Using data preprocessing techniques and machine learning algorithms to explore predictors of word difficulty in English language assessment
- Creators
- Mingying Zheng
- Contributors
- Jonathan Templin (Advisor)Aloe Ariel (Committee Member)Lesa Hoffman (Committee Member)Wan-Chan Lee (Committee Member)
- Resource Type
- Dissertation
- Degree Awarded
- Doctor of Philosophy (PhD), University of Iowa
- Degree in
- Psychological and Quantitative Foundations (Educational Measurement and Statistics)
- Date degree season
- Summer 2024
- Publisher
- University of Iowa
- DOI
- 10.25820/etd.007627
- Number of pages
- xii, 123 pages
- Copyright
- Copyright 2024 Mingying Zheng
- Language
- English
- Date submitted
- 07/05/2024
- Description illustrations
- illustrations, tables, graphs
- Description bibliographic
- Includes bibliographical references (pages 102-112).
- Public Abstract (ETD)
- In our digital age, English language tests are increasingly moving online, generating vast amounts of data. This study delves into the preprocessing of this raw data for further analyses, which is crucial for understanding and improving these tests. Imagine tidying up a room before you can appreciate its full potential; similarly, data must be cleaned and organized. This research explored two methods of doing so, one focusing on thorough cleaning and detailed organization, while the other took a more streamlined approach. The goal was to determine which method better enhances the data’s usefulness. Then, like picking the right tools to extract insights from the data, the study examined different automated learning systems (think of them as smart, self-learning algorithms) to find out which could most accurately predict the difficulty of English words for learners. The three top-performing systems were identified: one that showed consistent accuracy, and two that excelled at identifying correct answers. Furthermore, the study identified the most telling features—like word frequency and reported age of learning a given word—that indicate a word’s difficulty level. These insights could aid in designing better learning materials. In summary, this research showed that careful preparation of data leads to more accurate analyses, regardless of the method used. It also sheds light on how to effectively get data ready for study and which tools are best for extracting valuable insights. This helps not only in understanding language learning better but also in designing teaching tools that can adapt to different learners’ needs.
- Academic Unit
- Psychological and Quantitative Foundations
- Record Identifier
- 9984698250502771
Metrics
3 Record Views