Comparison of Human and Machine Scoring of Essays: Differences by Gender, Ethnicity, and Country NLP
- Author(s):
- Bridgeman, Brent; Trapani, Catherine S.; Attali, Yigal
- Publication Year:
- 2014
- Source:
- Wendler, Cathy; Bridgeman, Brent (eds.) with assistance from Chelsea Ezzo. The Research Foundation for the GRE revised General Test: A Compendium of Studies. Princeton, NJ: Educational Testing Service, 2014, p4.8.1-4.8.3
- Document Type:
- Chapter
- Page Count:
- 3
- Subject/Key Words:
- Group Identity, Graduate Record Examination (GRE), Revised GRE, Cultural Pluralism, Ethnic Groups, Standard Deviations, Multiculturalism, Automated Scoring and Natural Language Processing, NLP-Related Measurement Research
Abstract
Describes a study that investigated differences in scores produced by humans and e-rater by gender, ethnicity, and country of origin. For most groups studied, the average scores produced by e-rater and human raters were almost identical. A notable exception was essays from mainland China that received much higher ratings from e-rater than from human raters. This finding was especially puzzling because such differences were not noted in other Asian countries, or even in Taiwan, which shares the same language as mainland China. This study provided additional support for the use of a check score model for the GRE Analytical Writing measure, rather than averaging the human and e-rater scores.