Model selection for IRT equating of Testlet-based tests in the random groups design

Juan Chen

doi:10.17077/etd.9yfctk57

Back

Model selection for IRT equating of Testlet-based tests in the random groups design

Dissertation

Open access

Model selection for IRT equating of Testlet-based tests in the random groups design

Juan Chen

University of Iowa

Doctor of Philosophy (PhD), University of Iowa

Autumn 2014

DOI: 10.17077/etd.9yfctk57

Files and links (1)

pdf

Model selection for IRT equating of Testlet-based tests in the ra3.12 MBDownload View

Free to read and download, Open Access

Abstract

The use of testlets in a test can cause multidimensionality and local item dependence (LID), which can result in inaccurate estimation of item parameters, and in turn compromise the quality of item response theory (IRT) true and observed score equating of testlet-based tests. Both unidimensional and multidimensional IRT models have been developed to control local item dependence caused by testlets. The purposes of the current study were to (1) investigate how different levels of LID can affect IRT true and observed score equating of testlet-based tests when the traditional three parameter logistic (3PL) IRT model was used for calibration, and (2) compare the performance of four different IRT models, including the 3PL IRT model, graded response model (GRM), testlet response theory model (TRT), and bifactor model, in IRT true and observed score equating of testlet-based tests with various levels of local item dependence.

Both real and simulated data analyses were conducted in this study. Two testlet-based tests (i.e., Test A and Test B) that differed in subjects, test length, and testlet length were used in the real data analysis. For simulated data analysis, two main factors were investigated in this study: (1) testlet length (5 or 10), and (2) LID level within testlets that was defined by testlet effect variance (0, 0.25, 0.5625, 0.75, 1, and 1.5). For the unidimensional IRT models (i.e., 3PL IRT model and GRM), unidimensional IRT true score and observed score equating procedures, explained in Kolen and Brennan (2004), were used. For the two investigated multidimensional IRT models (i.e., 3PL TRT model and bifactor model), the unidimensional approximation of multidimensional item response theory (MIRT) true score equating procedure and the unidimensional approximation of MIRT observed score equating procedure (Brossman & Lee, 2013) were applied. The traditional equipercentile equating method was used as the baseline for comparison in both real data and simulated data analyses.

It was found in the study that both testlet length and the LID level affected the performance of the investigated models on IRT true and observed score equating of testlet-based tests. When the traditional 3PL IRT model was used for tests with long testlets, higher levels of local item dependence led to IRT equating results that deviated further away from those obtained from the baseline method. However, the effect of local item dependence on IRT equating results was not prominent for tests with short testlets.

Moreover, for tests consisting of long testlets (e.g., a testlet length of 10 or more) and having a very low level of local item dependence (e.g., a LID level of 0.25 or lower), and for tests consisting of short testlets (e.g., a testlet length around 5), all four investigated IRT models worked well in IRT true and observed score equating. For tests with long testlets and a relatively high level of local item dependence (e.g., a LID level of 0.5625 or higher), the GRM, bifactor, and TRT models outperformed the traditional 3PL IRT model in IRT true and observed equating of testlet-based tests.

The study suggested that the selection of models for IRT true and observed score equating of testlet-based tests should be considered with respect to the features of the testlet-based tests and the groups of examinees from which the data is collected. It is hoped that this study encourages researchers to identify differences among existing models for IRT true and observed score equating of testlet-based tests with various features, and to develop new models that are appropriate for modeling testlet-based tests to obtain accurate IRT number correct score equating results.

Educational Psychology

public abstract

Details

Title: Subtitle: Model selection for IRT equating of Testlet-based tests in the random groups design
Creators: Juan Chen - University of Iowa
Contributors: Michael J. Kolen (Advisor)
Deborah J. Harris (Advisor)
Won-Chan Lee (Committee Member)
Catherine J. Welch (Committee Member)
Aixin Tan (Committee Member)
Resource Type: Dissertation
Degree Awarded: Doctor of Philosophy (PhD), University of Iowa
Degree in: Psychological and Quantitative Foundations
Date degree season: Autumn 2014
DOI: 10.17077/etd.9yfctk57
Publisher: University of Iowa
Number of pages: xiv, 150 pages
Language: English
Description illustrations: illustrations
Description bibliographic: Includes bibliographical references (pages 103-108).
Public Abstract (ETD): Unidimensional item response theory (IRT) equating methods (Kolen & Brennan, 2004) are often used in testing programs to adjust score difficulty across multiple forms of a test. When test items are organized by testlets that share a common stimulus, multidimensionality and local item dependence (LID) might be present, resulting in a secondary dimension that is related to the stimulus. In this case, the testlet-based test might measure constructs in addition to examinees’ ability level.

This study compares the performance of four different models on IRT true and observed score equating of testlet-based tests that incorporated different testlet length and LID levels. These models include the three parameter logistic (3PL) IRT model, graded response theory model (GRM), 3PL testlet response theory model (TRT), and bifactor model.

The study found that both testlet length and the LID level affected the performance of the investigated IRT equating methods for testlet-based tests. For tests with long testlets, higher LID levels led to 3PL IRT equating results that deviated further away from those obtained from the baseline method. However, this trend was not as evident for tests containing short testlets. Moreover, for tests with long testlets and a low LID level, and for tests with short testlets, all four investigated IRT models worked well in IRT true and observed score equating. For tests with long testlets and a relatively high LID level, the GRM, bifactor, and TRT models outperformed the traditional 3PL IRT model in IRT true and observed equating of testlet-based tests.
Academic Unit: Psychological and Quantitative Foundations
Record Identifier: 9983776508602771

Metrics

1041 File views/ downloads

561 Record Views