The use of testlets in a test can cause multidimensionality and local item dependence (LID), which can result in inaccurate estimation of item parameters, and in turn compromise the quality of item response theory (IRT) true and observed score equating of testlet-based tests. Both unidimensional and multidimensional IRT models have been developed to control local item dependence caused by testlets. The purposes of the current study were to (1) investigate how different levels of LID can affect IRT true and observed score equating of testlet-based tests when the traditional three parameter logistic (3PL) IRT model was used for calibration, and (2) compare the performance of four different IRT models, including the 3PL IRT model, graded response model (GRM), testlet response theory model (TRT), and bifactor model, in IRT true and observed score equating of testlet-based tests with various levels of local item dependence.
Both real and simulated data analyses were conducted in this study. Two testlet-based tests (i.e., Test A and Test B) that differed in subjects, test length, and testlet length were used in the real data analysis. For simulated data analysis, two main factors were investigated in this study: (1) testlet length (5 or 10), and (2) LID level within testlets that was defined by testlet effect variance (0, 0.25, 0.5625, 0.75, 1, and 1.5). For the unidimensional IRT models (i.e., 3PL IRT model and GRM), unidimensional IRT true score and observed score equating procedures, explained in Kolen and Brennan (2004), were used. For the two investigated multidimensional IRT models (i.e., 3PL TRT model and bifactor model), the unidimensional approximation of multidimensional item response theory (MIRT) true score equating procedure and the unidimensional approximation of MIRT observed score equating procedure (Brossman & Lee, 2013) were applied. The traditional equipercentile equating method was used as the baseline for comparison in both real data and simulated data analyses.
It was found in the study that both testlet length and the LID level affected the performance of the investigated models on IRT true and observed score equating of testlet-based tests. When the traditional 3PL IRT model was used for tests with long testlets, higher levels of local item dependence led to IRT equating results that deviated further away from those obtained from the baseline method. However, the effect of local item dependence on IRT equating results was not prominent for tests with short testlets.
Moreover, for tests consisting of long testlets (e.g., a testlet length of 10 or more) and having a very low level of local item dependence (e.g., a LID level of 0.25 or lower), and for tests consisting of short testlets (e.g., a testlet length around 5), all four investigated IRT models worked well in IRT true and observed score equating. For tests with long testlets and a relatively high level of local item dependence (e.g., a LID level of 0.5625 or higher), the GRM, bifactor, and TRT models outperformed the traditional 3PL IRT model in IRT true and observed equating of testlet-based tests.
The study suggested that the selection of models for IRT true and observed score equating of testlet-based tests should be considered with respect to the features of the testlet-based tests and the groups of examinees from which the data is collected. It is hoped that this study encourages researchers to identify differences among existing models for IRT true and observed score equating of testlet-based tests with various features, and to develop new models that are appropriate for modeling testlet-based tests to obtain accurate IRT number correct score equating results.