skip to main content skip to footer

Investigating Constructed-Response Scoring Over Time: The Effects of Study Design on Trend Rescore Statistics CR

Donoghue, John R.; McClellan, Catherine; Hess, Melinda R.
Publication Year:
Report Number:
ETS Research Report
Document Type:
Page Count:
Subject/Key Words:
Large-Scale Assessment, Constructed Response, Scoring, Trend Scoring Method, Score Comparison, Human Rater, Rater Monitoring, Interrater Reliability, Automated Scoring, Reweighting, Statistics, Rescoring


When constructed-response items are administered for a second time, it is necessary to evaluate whether the current Time B administration’s raters have drifted from the scoring of the original administration at Time A. To study this, Time A papers are sampled and rescored by Time B scorers. Commonly the scores are compared using the proportion of exact agreement across times and/or t-statistics comparing Time A means to Time B means. It is common to treat these rescores with procedures that assume a multinomial sampling model, which is incorrect. The correct, product-multinomial model reflects the stratification of Time A scores. Using direct computation, the research report demonstrates that both proportion of exact agreement and the t-statistic can deviate substantially from expected behavior, providing misleading results. Reweighting the rescore table gives each statistic the correct expected value but does not guarantee that the usual sampling distributions hold. It is also noted that the results apply to a wider class of situations in which a set of papers is scored by one group of raters or scoring engine and then a sample is selected to be evaluated by a different group of raters or scoring engine.

Read More