When judges use a rating scale to rate performances, some judges may rate more severely than others, giving lower ratings. Judges may also differ in the consistency with which they apply rating criteria. In this study, we pilot-tested a quality control procedure that provides a means for monitoring and adjusting for differences in reader performance. We employed FACETS, a Rasch-based rating scale analysis procedure, to calibrate readers within and across two TWE® (Test of Written English™) administrations. Our study had four goals: (a) to determine the extent to which individual readers can be considered interchangeable, both within and across TWE administrations; (b) to investigate reader characteristics and their relationships to the volume and quality of ratings; (c) to examine the effectiveness of third readings to adjudicate rating discrepancies; and (d) to make a preliminary determination of the feasibility of using FACETS Reader Severity Measures as a first step toward equating TWE scores across different topics and readers.