Reader Calibration and Its Potential Role in Equating for the Test of Written English

Myford, Carol M.; Marr, Diane; Linacre, J. Michael
Publication Year:
Report Number:
RR-95-40, TOEFL-RR-52
Document Type:
Subject/Key Words:
Equated scores essay tests FACETS interrater reliability item response theory (IRT) performance assessment Rasch model writing evaluation


When judges use a rating scale to rate performances, some judges may rate more severely than others, giving lower ratings. Judges may also differ in the consistency with which they apply rating criteria. In this study, we pilot-tested a quality control procedure that provides a means for monitoring and adjusting for differences in reader performance. We employed FACETS, a Rasch-based rating scale analysis procedure, to calibrate readers within and across two TWE® (Test of Written English™) administrations. Our study had four goals: (a) to determine the extent to which individual readers can be considered interchangeable, both within and across TWE administrations; (b) to investigate reader characteristics and their relationships to the volume and quality of ratings; (c) to examine the effectiveness of third readings to adjudicate rating discrepancies; and (d) to make a preliminary determination of the feasibility of using FACETS Reader Severity Measures as a first step toward equating TWE scores across different topics and readers.

Read More


Find a Publication

Advanced Search

Closing the Achievement Gap

Closing the Achievement GapLearn more about ETS's commitment to closing the achievement gap through rigorous research, thought-provoking forums and more.