Automated scoring engines extract features from examinee responses that represent important aspects of performance on constructed-response tasks. ETS conducts research into methods for estimating scores based on these features, while ensuring that psychometric standards are maintained. Examples of such research include:
- Researching different methods of calibrating response features to represent the important constructs for an assessment
- Calibrating and evaluating different statistical and/or heuristic models for each scoring engine
- Conducting research into establishing and enhancing the reliability and validity of automated scores
ETS has conducted extensive research to establish best practices in the application of automated scoring engines. This research allows ETS to confidently apply these standards and processes to operational assessments.
The automated scoring engines calibrated and evaluated in this research include the e-rater®, SpeechRater®, c-rater™ and m-rater engines, as well as hybrid systems designed to address the needs of specific scoring programs. This is an applied field of research, since it aims to evaluate automated scoring engines in operational practice for both low- and high-stakes assessments.
Topics included in the ETS Measurement research agenda, both presently and for the future, are:
- Developing and evaluating new methods for training automated scoring models, including rule-based, machine learning and statistical methods
- Evaluation of the quality of automated scores and automated scoring processes, including effects of different training methods, gaming strategies and criterion measures
- Establishing and enhancing reliability and validity of automated scores and scoring processes, including exploring external validity criteria, fairness investigations, evaluation criteria for smaller samples and sampling variations for training sets
- Evaluation of automated scoring for new low- and high-stakes assessments and applications, including graduate admissions, placement and screening
- Investigating the vulnerability of scoring engines to the use of construct-irrelevant response strategies that aim to artificially increase scores and enhancing scoring engines to be immune to such response strategies
Featured Publications
2015
-
Reliability-Based Feature Weighting for Automated Essay Scoring
Y. Attali
Applied Psychological Measurement, Vol. 39, No. 4, pp. 303–313This article presents alternatives to human-prediction weighting schemes in automated scoring. Learn more about this publication >
-
Evaluation of e-rater® for the Praxis I® Writing Test
C. Ramineni, C. S. Trapani, & D. M. Williamson
ETS Research Report No. RR-15-03The article presents a study of models for automated scoring which were used to evaluate the essay task in the Praxis I writing test. Several techniques were used to compare the performance e-rater model with human scores. The scoring model was also evaluated across various demographic subgroups. Learn more about this publication >
-
Validating Automated Essay Scoring: A (Modest) Refinement of the “Gold Standard”
D. E. Powers, D. S. Escoffery, & M. P. Duchnowski
Applied Measurement in Education, Vol. 28, No. 2, pp. 130–142The authors discuss a way to improve methods of validating automated essay scores that rely on comparisons with scores set by human raters. Learn more about this publication >
-
Evaluating the Detection of Aberrant Responses in Automated Essay Scoring
M. Zhang, J. Chen, & C. Ruan. In L. A. van der Ark, D. M. Bolt, W.-C. Wang, J. A. Douglas, & S.-M. Chow (Eds.), Quantitative Psychology Research: The 79th Annual Meeting of the Psychometric Society
Series: Springer Proceedings in Mathematics & Statistics, Vol. 140, pp. 191–208
Publisher: SpringerIn this chapter, the authors consider how automated scoring systems can identify aberrant responses — those that have characteristics that the scoring system cannot process. Learn more about this publication >
2012
-
Comparison of Human and Machine Scoring of Essays: Differences by Gender, Ethnicity, and Country
B. Bridgeman, P. Trapani, & Y. Attali
Applied Measurement in Education, Vol. 25, pp. 27–40This article describes a study that used essay data from two large-scale testing programs to explore possible differences between some gender, ethnic or country groups when comparing human and machine scores. Learn more about this publication >
-
Evaluation of e-rater® for the GRE® Issue and Argument Prompts
C. Ramineni, C. Trapani, D. M. Williamson, T. Davey, & B. Bridgeman
ETS Research Report RR-12-02For this study, researchers built automated scoring models for the e-rater® scoring engine and evaluated them on argument and issue-writing tasks for the GRE® General Test. This report describes the models that were selected for operational use. Learn more about this publication >
-
Evaluation of the e-rater® Scoring Engine for the TOEFL® Independent and Integrated Prompts
C. Ramineni, C. Trapani, D. M. Williamson, T. Davey, & B. Bridgeman
ETS Research Report No. RR-12-06In this study, researchers built scoring models for the e-rater scoring engine and evaluated them for use on the TOEFL® test's independent and integrated writing prompts. This report describes the models that were recommended for operational use. Learn more about this publication >
-
A Framework for Evaluation and Use of Automated Scoring
D. M. Williamson, X. Xi, & F. J. Breyer
Educational Measurement: Issues and Practice, Vol. 31, No. 1, pp. 2–13This article provides a framework for evaluating and using automated scoring for constructed-response tasks. Learn more about this publication >
2011
-
A Validity-Based Approach to Quality Control and Assurance of Automated Scoring
I. Bejar
Assessment in Education: Principles, Policy & Practice, Vol. 18, No. 3, pp. 319–341In this article, the author proposes arguments for viewing validity and quality control in automated scoring as part of the same process of application design.
Learn more about this publication >
Find More Articles
View more research publications related to NLP-related measurement.