skip to main content skip to footer

Exploration of the Proportional Reduction in Mean-Squared Error for Evaluating Automated Scores PRMSE AIAST QWK ELA IRT TCC

Author(s):
Casabianca, Jodi M.; McCaffrey, Daniel F.; Johnson, Matthew.; Ricker-Pedley, Kathryn L.; Rotou, Ourania; Martineau, Joseph
Publication Year:
2023
Report Number:
RM-23-01
Source:
ETS Research Memorandum
Document Type:
Report
Page Count:
29
Subject/Key Words:
Proportional Reduction in Mean-Squared Error (PRMSE), Artificial Intelligence and Automated Scoring Technologies (AIAST), Human Raters, Quadratic Weighted Kappa (QWK), Evaluation Criteria, English Language Assessment (ELA), Item Response Theory (IRT), Test Characteristic Curve (TCC) Linking, Item Scores

Abstract

A key challenge to the use of automated scoring is developing evidence that the resulting scores support the goals and claims of the assessment. The concordance between the automated scores and the human ratings is a critical piece of evidence when the item being scored was designed to be evaluated by a human rater and the automated score is a prediction of a human rating. Current practice for evaluating automated scoring models typically uses the quadratic weighted kappa and the correlation between human and automated scores. This study explores the proportional reduction in mean-squared error (PRMSE) as an alternative to those standard concordance measures. To date, there have been no established rules of thumb or thresholds to apply to the PRMSE in model evaluation. Using empirical data, we explore various conditions by manipulating automated scores to induce changes in the PRMSE and the properties of the final reportable scores. This allows us to link variation in the PRMSE to impact on scores. Based on the results, we establish possible guidelines for using the PRMSE in practice given several factors including the length and stakes of the test.

Read More