Comparing the Validity of Automated and Human Essay Scoring
- Powers, Donald E.; Burstein, Jill C.; Chodorow, Martin; Fowles, Mary E.; Kukich, Karen
- Publication Year:
- Report Number:
- Document Type:
- Subject/Key Words:
- Writing assessment writing skills Graduate Record Examinations (GRE) validity automated scoring essay scoring
This study sought to provide further evidence of the validity of automated, or computer-based, scores of complex performance assessments, such as direct tests of writing skill that require examinees to construct responses rather than select them from a set of multiple choices. While several studies have examined agreement between human scores and automated scoring systems, only a few have provided evidence of the relationship of automated scores to other, independent indicators of writing skill. This study examined relationships of each of two sets of Graduate Record Examinations® (GRE®) Writing Assessment scores--those given by human raters and those generated by e-rater (the system being researched for possible application in a variety of assessments that require natural language responses)--to several independent, nontest indicators of writing skill, such as academic, outside, and perceived success with writing. Analyses revealed significant, but modest, correlations between the nontest indicators and each of the two methods of scoring. That automated and human scores exhibited reasonably similar relations with the nontest indicators was taken as evidence that the two methods of scoring reflect similar aspects of writing proficiency. These relations were, however, somewhat weaker for automated scores than for scores awarded by humans.