angle-up angle-right angle-down angle-left close user menu open menu closed search globe bars phone store

Automated Scoring of Written Content

Many assessments are designed to measure test takers’ English writing skills — for example, whether they can organize and develop an argument and write fluently with no grammatical errors. However, in some scenarios, assessments use open-ended tasks not to measure the quality of writing, but rather as a way of gathering evidence of what test takers know, have learned or can do in a specific subject area.

Multiple-choice questions can provide one way to assess understanding of content, but they may not always provide the most complete picture of a test takers’ knowledge. This is because multiple-choice questions test, in part, the test takers’ ability to recognize and select an answer from a list of options. It is hard to say with certainty whether the test taker would have provided the correct answer if it had not been included as one of the multiple-choice options.

Allowing test takers to instead write a free response can provide a more complete assessment of their understanding. The choice to use multiple-choice questions is usually a practical decision motivated by the time and costs associated with having humans grade open-ended responses. Controlling for these factors, open-ended written answers are often preferable to multiple-choice responses when it comes to assessing content knowledge.

At ETS, we have been conducting significant research on accurately scoring the content of written responses for more than a decade. In that time, our approach has evolved. We previously used natural language processing techniques to assess whether a given response contained text that corresponds to the concepts listed in the rubric for an item, or test question.

This approach requires significant human effort to describe the key concepts that the automated scoring system should find in a correct response to each item. Our more recent approaches use machine learning techniques, which do not require someone to manually enter all possible correct responses into the system. Instead, they simply require an appropriate set of responses to the item that have already been holistically scored by trained raters.

This approach represents the state of the art in computational linguistics and related fields, and it draws from extensive research that ETS has conducted on automated content scoring. Prototype systems using this approach have demonstrated excellent performance in public competitions and shared tasks (e.g., the Automated Scoring Assessment Prize [ASAP] in 2012 sponsored by the Hewlett Foundation, the Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge in 2013; Heilman & Madnani, 2013).

It has also been validated on responses from different content areas, including science, reading comprehension and mathematics.

In addition to assessing whether test takers understand a concept, content scoring may be used to evaluate whether a writer has successfully used source material — for example, test questions that require students to read one or more passages and include relevant information from these sources in an effective response.

ETS has also designed an algorithm that can quantify the extent to which test takers make appropriate use of the given sources in their responses. For example, it can quantify not only how much of the information from a specific source was used in the response but also the importance of that information (Beigman Klebanov et al., 2014).

Our current research also focuses on extending the range of applications for automated content assessment. For example, we are investigating how best to use automated systems to provide feedback — for example, on content knowledge or use of sources — in the classroom or in online classes.

Featured Publications

Below are some recent or significant publications that our researchers have authored on the subject of automated scoring of written content.








  • Towards Effective Tutorial Feedback for Explanation Questions: A Dataset and Baselines
    M. O. Dzikovska, R. D. Nielsen, & C. Brew
    Paper in Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 200–210

    The authors propose a new shared task for grading student answers, where the goal is to enable targeted and flexible feedback in the form of a tutorial dialogue. They suggest that this corpus will be of interest to the researchers working in textual entailment and will stimulate new developments both in natural language processing in tutorial dialogue systems and textual entailment, contradiction detection and other techniques of interest for a variety of computational linguistics tasks. Learn more about this publication

  • Measuring the Use of Factual Information in Test-Taker Essays
    B. Beigman-Klebanov & D. Higgins
    Paper in Proceedings of the 7th Workshop on the Innovative Use of NLP for Building Educational Applications, pp. 63–72

    The authors studied how to measure the use of factual information in test-taker essays and how to assess its effectiveness when predicting essay scores. The article also discusses implications for development of automated essay scoring systems. Learn more about this publication





Find More Articles

View more research publications related to automated scoring of written content.