angle-up angle-right angle-down angle-left close user menu open menu closed search globe bars phone store

Evaluation of the e-rater Scoring Engine for the TOEFL Independent and Integrated Prompts

Ramineni, Chaitanya; Trapani, Catherine S.; Williamson, David M.; Davey, Tim; Bridgeman, Brent
Publication Year:
Report Number:
ETS Research Report
Document Type:
Page Count:
Subject/Key Words:
Automated Essay Scoring (AES) Automated Scoring and Natural Language Processing Automated Scoring Models Electronic Essay Rater (E-rater) Natural Language Processing (NLP) NLP-Related Measurement Research Test of English as a Foreign Language (TOEFL) Writing


Scoring models for the e-rater system were built and evaluated for the TOEFL exam’s independent and integrated writing prompts. Prompt-specific and generic scoring models were built, and evaluation statistics, such as weighted kappas, Pearson correlations, standardized differences in mean scores, and correlations with external measures, were examined to evaluate the e-rater model performance against human scores. Performance was also evaluated across different demographic subgroups. Additional analyses were performed to establish appropriate agreement thresholds between human and e-rater scores for unusual essays and the impact of using e-rater on operational scores. Generic e-rater scoring models were recommended for operational use for both independent and integrated writing tasks. The two automated scoring models were recommended for operational use to produce contributory scores within a discrepancy threshold of 1.5 and 1.0 with a human score for independent and integrated prompts respectively.

Read More