angle-up angle-right angle-down angle-left close user menu open menu closed search globe bars phone store

NLP-related Measurement Research

Automated scoring engines extract features from examinee responses that represent important aspects of performance on constructed-response tasks. ETS conducts research into methods for estimating scores based on these features, while ensuring that psychometric standards are maintained. Examples of such research include:

  • Researching different methods of calibrating response features to represent the important constructs for an assessment
  • Calibrating and evaluating different statistical and/or heuristic models for each scoring engine
  • Conducting research into establishing and enhancing the reliability and validity of automated scores

ETS has conducted extensive research to establish best practices in the application of automated scoring engines. This research allows ETS to confidently apply these standards and processes to operational assessments.

The automated scoring engines calibrated and evaluated in this research include the e-rater®, SpeechRater®, c-rater and m-rater engines, as well as hybrid systems designed to address the needs of specific scoring programs. This is an applied field of research, since it aims to evaluate automated scoring engines in operational practice for both low- and high-stakes assessments.

Topics included in the ETS Measurement research agenda, both presently and for the future, are:

  • Developing and evaluating new methods for training automated scoring models, including rule-based, machine learning and statistical methods
  • Evaluation of the quality of automated scores and automated scoring processes, including effects of different training methods, gaming strategies and criterion measures
  • Establishing and enhancing reliability and validity of automated scores and scoring processes, including exploring external validity criteria, fairness investigations, evaluation criteria for smaller samples and sampling variations for training sets
  • Evaluation of automated scoring for new low- and high-stakes assessments and applications, including graduate admissions, placement and screening
  • Investigating the vulnerability of scoring engines to the use of construct-irrelevant response strategies that aim to artificially increase scores and enhancing scoring engines to be immune to such response strategies

Featured Publications




Find More Articles

View more research publications related to NLP-related measurement.