skip to main content skip to footer

Reader Calibration and Its Potential Role in Equating for the Test of Written English IRT TWE

Linacre, J. Michael; Marr, Diana; Myford, Carol M.
Publication Year:
Report Number:
RR-95-40, TOEFL-RR-52
ETS Research Report
Document Type:
Page Count:
Subject/Key Words:
Equated Scores, Essay Tests, FACETS, Interrater Reliability, Item Response Theory (IRT), Performance Assessment, Rasch Model, Test of Written English (TWE), Writing Evaluation


When judges use a rating scale to rate performances, some judges may rate more severely than others, giving lower ratings. Judges may also differ in the consistency with which they apply rating criteria. In this study, we pilot-tested a quality control procedure that provides a means for monitoring and adjusting for differences in reader performance. We employed FACETS, a Rasch-based rating scale analysis procedure, to calibrate readers within and across two TWE® (Test of Written English™) administrations. Our study had four goals: (a) to determine the extent to which individual readers can be considered interchangeable, both within and across TWE administrations; (b) to investigate reader characteristics and their relationships to the volume and quality of ratings; (c) to examine the effectiveness of third readings to adjudicate rating discrepancies; and (d) to make a preliminary determination of the feasibility of using FACETS Reader Severity Measures as a first step toward equating TWE scores across different topics and readers.

Read More