NLP-related Measurement Research
Automated scoring engines extract features from examinee responses that represent important aspects of performance on constructed-response tasks. ETS conducts research into methods for estimating scores based on these features, while ensuring that psychometric standards are maintained. Examples of such research include:
- Researching different methods of calibrating response features to represent the important constructs for an assessment
- Calibrating and evaluating different statistical and/or heuristic models for each scoring engine
- Conducting research into establishing and enhancing the reliability and validity of automated scores
ETS has conducted extensive research to establish best practices in the application of automated scoring engines. This research allows ETS to confidently apply these standards and processes to operational assessments.
The automated scoring engines calibrated and evaluated in this research include the e-rater®, SpeechRaterSM, c-rater™ and m-rater engines, as well as hybrid systems designed to address the needs of specific scoring programs. This is an applied field of research, since it aims to evaluate automated scoring engines in operational practice for both low- and high-stakes assessments.
Topics included in the ETS Measurement research agenda, both presently and for the future, are:
- Developing and evaluating new methods for training automated scoring models, including rule-based, machine learning and statistical methods
- Evaluation of the quality of automated scores and automated scoring processes, including effects of different training methods, gaming strategies and criterion measures
- Establishing and enhancing reliability and validity of automated scores and scoring processes, including exploring external validity criteria, fairness investigations, evaluation criteria for smaller samples and sampling variations for training sets
- Evaluation of automated scoring for new low- and high-stakes assessments and applications, including graduate admissions, placement and screening
- Investigating the vulnerability of scoring engines to the use of construct-irrelevant response strategies that aim to artificially increase scores and enhancing scoring engines to be immune to such response strategies
Featured Publications
2012
-
Comparison of Human and Machine Scoring of Essays: Differences by Gender, Ethnicity, and Country
B. Bridgeman, P. Trapani, & Y. Attali
Applied Measurement in Education, Vol. 25, pp. 27–40This article describes a study that used essay data from two large-scale testing programs to explore possible differences between some gender, ethnic or country groups when comparing human and machine scores. Read more on the publisher's website.
-
Evaluation of e-rater® for the GRE® Issue and Argument Prompts
C. Ramineni, C. Trapani, D. M. Williamson, T. Davey, & B. Bridgeman
ETS Research Report RR-12-02For this study, researchers built automated scoring models for the e-rater® scoring engine and evaluated them on argument and issue-writing tasks for the GRE® General Test. This report describes the models that were selected for operational use. Read the full abstract or download this report.
-
Evaluation of the e-rater® Scoring Engine for the TOEFL® Independent and Integrated Prompts
C. Ramineni, C. Trapani, D. M. Williamson, T. Davey, & B. Bridgeman
ETS Research Report No. RR-12-06In this study, researchers built scoring models for the e-rater scoring engine and evaluated them for use on the TOEFL® test's independent and integrated writing prompts. This report describes the models that were recommended for operational use. Read the full abstract or download this report.
-
A Framework for Evaluation and Use of Automated Scoring
D. M. Williamson, X. Xi, & F. J. Breyer
Educational Measurement: Issues and Practice, Vol. 31, No. 1, pp. 2–13This article provides a framework for evaluating and using automated scoring for constructed-response tasks. Read more on the publisher's website.
2011
-
A Validity-Based Approach to Quality Control and Assurance of Automated Scoring
I. Bejar
Assessment in Education: Principles, Policy & Practice, Vol. 18, No. 3, pp. 319–341In this article, the author proposes arguments for viewing validity and quality control in automated scoring as part of the same process of application design. Read more on the publisher's website.