skip to main content skip to footer

A Comparative Study of Methods of Equating TOEFL Test Scores IRT TOEFL

Hicks, Marilyn M.
Publication Year:
Report Number:
ETS Research Report
Document Type:
Page Count:
Subject/Key Words:
Equated Scores, Item Response Theory (IRT), Statistical Analysis, Test of English as a Foreign Language (TOEFL)


Six methods of equating TOEFL test scores for samples consisting of the usual groups of examinees tested at each TOEFL administration, and groups of examinees controlled for native language representation were evaluated in terms of scale stability. The equating methods included three IRT variants (fixed b's scaling, a one-parameter model in which a- and c-parameters were fixed at constant values, and a model in which all three parameters were re-estimated), and three conventional equating methods (Tucker, Levine and equipercentile). The equating methods were applied to Section II, Structure and Written Expression, and Section III, Reading Comprehension and Vocabulary. For the regular group of examinees, fixed b's IRT equating exhibited the greatest scale stability for both sections with the one-parameter IRT model and Tucker linear equating following in that order. For most equating methods, controlling for native language resulted in increased scale stability relative to the regular group for Section II, but produced more error in Section III. This interaction between Section III and the controlled group may be related to the differential performance observed among language groups on Section III in previous studies. The results of this study supported continued use of fixed b's scaling for TOEFL data using a random sample of examinees from the total testing group. (67pp.)

Read More