Comparison of Human and Machine Scoring of Essays: Differences by Gender, Ethnicity, and Country NLP

Author(s):: Bridgeman, Brent; Trapani, Catherine S.; Attali, Yigal
Publication Year:: 2014
Source:: Wendler, Cathy; Bridgeman, Brent (eds.) with assistance from Chelsea Ezzo. The Research Foundation for the GRE revised General Test: A Compendium of Studies. Princeton, NJ: Educational Testing Service, 2014, p4.8.1-4.8.3
Document Type:: Chapter
Page Count:: 3
Subject/Key Words:: Group Identity, Graduate Record Examination (GRE), Revised GRE, Cultural Pluralism, Ethnic Groups, Standard Deviations, Multiculturalism, Automated Scoring and Natural Language Processing, NLP-Related Measurement Research

Abstract

Describes a study that investigated differences in scores produced by humans and e-rater by gender, ethnicity, and country of origin. For most groups studied, the average scores produced by e-rater and human raters were almost identical. A notable exception was essays from mainland China that received much higher ratings from e-rater than from human raters. This finding was especially puzzling because such differences were not noted in other Asian countries, or even in Taiwan, which shares the same language as mainland China. This study provided additional support for the use of a check score model for the GRE Analytical Writing measure, rather than averaging the human and e-rater scores.

http://www.ets.org/s/research/pdf/gre_compendium.pdf

Comparison of Human and Machine Scoring of Essays: Differences by Gender, Ethnicity, and Country NLP

Abstract

Read More