skip to main content skip to footer

An Investigation of the Use of Simplified IRT Models for Scaling and Equating the TOEFL Test: Technical Report IRT TOEFL

Way, Walter D.; Reese, Clyde M.
Publication Year:
Report Number:
RR-90-29, TOEFL-TR-02
ETS Research Report
Document Type:
Page Count:
Subject/Key Words:
Equated Scores, Estimation (Mathematics), Item Response Theory (IRT), Mathematical Models, Scaling, Statistical Analysis, Test of English as a Foreign Language (TOEFL)


The purpose of this study was to explore the use of two alternative item response theory estimation models in the scaling and equating of the TOEFL® test—a modified one-parameter model (M1LP) and a modified two- parameter model (M2PL)—and to compare item scaling and test equating results based on these two alternative models with results based on the three-parameter model (3PL) that is currently being used to scale and equate the TOEFL test. The study employed a design in which a typical TOEFL equating was simulated using artificial data. The simulated equatings were compared in terms of correlations between estimated and generating parameters, model-data fit, and concordance of simulated score conversions with conversions based on the generating parameters. The results of the study clearly indicated that the 3PL model performed better than the M1PL and M2PL models on the basis of each of the evaluation criteria. There was also evidence that the M2PL model performed better than the M1PL model, particularly in terms of model-data fit and in the weighted root mean square difference statistics used to evaluate the simulated score conversions. The results of the study also indicated that discrepancies between score conversions based on the M1PL and M2PL models and those based on the 3PL model tended to occur at the lower and upper ends of the score scales. Finally, the results of the study for the 3PL model indicated that while correlations between item parameter estimates and generating parameters tended to be affected by sample size, neither the quality of model-data fit nor the quality of simulated equatings appeared to be sensitive to the different sample sizes used in the study.

Read More