skip to main content skip to footer

Automated Essay Scoring at Scale: A Case Study in Switzerland and Germany ELA TOEFL AES QWK

Author(s):
Rupp, Andre A.; Casabianca, Jodi; Kruger, Maleika; Keller, Stefan; Koller, Olaf
Publication Year:
2019
Report Number:
RR-19-12, TOEFL-RR-86
Source:
ETS Research Report
Document Type:
Report
Page Count:
23
Subject/Key Words:
Test of English as a Foreign Language (TOEFL), Automated Essay Scoring (AES), Human Scoring, e-rater, Germany, Switzerland, Generic Scoring Model, Prompt-Specific Scoring Model, Essay Prompts, Proportional Reduction in Mean-Squared Error, Quadratic Weighted Kappa (QWK), Scoring Reliability, Writing Tasks, Writing Assessment, High School Students, Secondary Education

Abstract

In this research report, we describe the design and empirical findings for a large‐scale study of essay writing ability with approximately 2,500 high school students in Germany and Switzerland on the basis of 2 tasks with 2 associated prompts, each from a standardized writing assessment whose scoring involved both human and automated components. For the human scoring aspect, we describe the methodology for training and monitoring human raters as well as for collecting their ratings within a customized platform. For the automated scoring aspect, we describe the methodology for training, evaluating, and selecting appropriate automated scoring models as well as correlational patterns of resulting task scores with scores from secondary measures. Analyses show that the human ratings were highly reliable and that effective prompt‐specific automated scoring models could be built with state‐of‐the‐art features and machine learning methods, which resulted in correlational patterns with secondary measures that were in line with general expectations. In closing, we discuss the methodological implications for conducting this kind of work at scale in the future.

Read More