The Effect of Small Calibration Sample Sizes on TOEFL® IRT-Based Equating

Tang, K. Linda; Way, Walter D.; Carey, Patricia A.
Publication Year:
Report Number:
Document Type:
Subject/Key Words:
BILOG equated scores item response theory logist sample size scaling


The present study compared the performance of LOGIST and BILOG on TOEFL® IRT-based scaling and equating using both real and simulated data and two calibration structures. Applications of IRT for the TOEFL program are based on the three-parameter logistic (3PL) model. The results of the study show that item parameter estimates obtained from the smaller real data sample sizes were more consistent with the larger sample estimates when based on BILOG than when based on LOGIST. In addition, the root mean squared error statistics suggest that the BILOG estimates for the item parameters and item characteristic curves were closer in magnitude to the "true" parameter values than were the LOGIST estimates. The equating results based on the parameter estimates suggest that the rule of thumb recommendation that pretest sample sizes be at least 1000 for LOGIST should be retained if at all possible.

Read More


Find a Publication

Advanced Search

Closing the Achievement Gap

Closing the Achievement GapLearn more about ETS's commitment to closing the achievement gap through rigorous research, thought-provoking forums and more.