Simulated Equating Using Several Item Response Curves

Author(s):
Boldt, Robert F.
Publication Year:
1994
Report Number:
RR-93-57, TOEFL-TR-08
Source:
Document Type:
Subject/Key Words:
Equated scores item response models simulation

Abstract

Previous research has shown that a single factor, factor being used in the sense of factor analysis, gave a very good account of item covariances within TOEFL® sections. This result is consistent with the assumption that the product of a person parameter and an item parameter models the probability that the person would pass the item. The assumption forms a simple item response model. A subsequent cross- validation study using this model supported the efficacy of that assumption by predicting item success as accurately as did the three-parameter logistic model (3PL) and a modified Rasch model. The purpose of the current study was to extend the comparison of models to an equating context. "Equating" is a statistical process that identifies comparable scores from parallel tests administered to different populations. In an operational context, equating serves to facilitate comparison of scores generated on different forms of a test. The present study consisted of simulation trials designed to "equate the test to itself." That is, equating sample data were generated from administration of identical item sets. It is useful to do this as a test of model validity, because if the same item sets are used to equate, an accurate equating would identify equal scores as comparable. Discrepancies between comparable scores signify error model misfit or random error. Equatings that used procedures based on each model were accomplished under several conditions, and the results were compared. The conditions varied by sample size, anchor test difficulty, and the TOEFL section equated. In order to compound the difficulty of the equating task, results were based on equating samples that were mismatched in performance on a correlated measure. Most discrepancies between comparable scores were largest at the extremes. The largest discrepancies between scores identified as comparable occurred for the 3PL and modified Rasch models at the lower extreme scores, and for the simple models at the upper extreme score. For the 1,000-case sample, most were in fractions of score points. As expected, 3PL equatings exhibited the largest discrepancies for the 100-case sample. The simple item response model yielded the most discrepancies that were in excess of the standard error of measurement, in part because with that model the maximum discrepancies occurred at the top of the score range, where the standard errors of measurement approach zero. Imposing an upper bound on the probability of correct response in the simple model markedly reduced its errors. TOEFL scores are used for educational decisions. If it is true that most institutions' cut-scores occur in the mid-score ranges, the present study suggests that 3PL should not be used if equating samples are substantially reduced from the present size. The other models are promising for small-sample equating, with the one-parameter logistic models being most promising.

Read More