skip to main content skip to footer

Strengthening the Ties That Bind: Improving the Linking Network in Sparsely Connected Rating Designs TOEFL IRT TSE

Myford, Carol M.
Publication Year:
Report Number:
RR-00-09, TOEFL-TR-15
ETS Research Report
Document Type:
Page Count:
Subject/Key Words:
Test of English as a Foreign Language (TOEFL), Speaking Assessment, Item Response Theory (IRT), Performance Assessment, Quality Control, Rater Performance, Oral Language, Rasch Measurement, FACETS, Test of Spoken English (TSE)


The purpose of this study was to evaluate the effectiveness of a strategy for linking raters when there are large numbers of raters involved in a scoring session and the overlap among raters is minimal. In sparsely connected rating designs, the number of examinees any given pair of raters has scored in common is very limited. Connections between raters may be weak and tentative at best. The linking strategy we employed involved having all raters in a Test of Spoken English (TSE) scoring session rate a small set of six benchmark audiotapes, in addition to those examinee tapes that each rater scored as part of his or her normal workload. Using output from Facets analyses of the rating data, we looked at the effects of embedding blocks of ratings from various smaller sets of these benchmark tapes on key indicators of rating quality. We found that all of our benchmark sets were effective for establishing at least the minimal connectivity needed in the rating design in order to allow placement of all raters and all examinees on a single scale. When benchmark sets were used, the highest scoring benchmarks (i.e., those examinees that scored 50s and 60s across the items) produced the highest quality linking (i.e., the most stable linking). The least consistent benchmark sets (i.e., those that were somewhat harder to rate because an examinee's performance varied across items) tended to provide fairly stable links. The most consistent benchmarks (i.e., those that were somewhat easier to rate because an examinee's performance was similar across items) and middle scoring benchmarks (i.e., those from examinees who scored 30s and 40s across the items) tended to provide less stable linking. Low scoring benchmark sets provided the least stable linking. When a single benchmark tape was used, the highest scoring single tape provided higher quality linking than either the least consistent or most consistent benchmark tape.

Read More