Many assessments are designed to measure test takers’ English writing skills — for example, whether they can organize and develop an argument and write fluently with no grammatical errors. However, in some scenarios, assessments use open-ended tasks not to measure the quality of writing, but rather as a way of gathering evidence of what test takers know, have learned or can do in a specific subject area.
Multiple-choice questions can provide one way to assess understanding of content, but they may not always provide the most complete picture of a test takers’ knowledge. This is because multiple-choice questions test, in part, the test takers’ ability to recognize and select an answer from a list of options. It is hard to say with certainty whether the test taker would have provided the correct answer if it had not been included as one of the multiple-choice options.
Allowing test takers to instead write a free response can provide a more complete assessment of their understanding. The choice to use multiple-choice questions is usually a practical decision motivated by the time and costs associated with having humans grade open-ended responses. Controlling for these factors, open-ended written answers are often preferable to multiple-choice responses when it comes to assessing content knowledge.
At ETS, we have been conducting significant research on accurately scoring the content of written responses for more than a decade. In that time, our approach has evolved. We previously used natural language processing techniques to assess whether a given response contained text that corresponds to the concepts listed in the rubric for an item, or test question.
This approach requires significant human effort to describe the key concepts that the automated scoring system should find in a correct response to each item. Our more recent approaches use machine learning techniques, which do not require someone to manually enter all possible correct responses into the system. Instead, they simply require an appropriate set of responses to the item that have already been holistically scored by trained raters.
This approach represents the state of the art in computational linguistics and related fields, and it draws from extensive research that ETS has conducted on automated content scoring. Prototype systems using this approach have demonstrated excellent performance in public competitions and shared tasks (e.g., the Automated Scoring Assessment Prize [ASAP] in 2012 sponsored by the Hewlett Foundation, the Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge in 2013; Heilman & Madnani, 2013).
It has also been validated on responses from different content areas, including science, reading comprehension and mathematics.
In addition to assessing whether test takers understand a concept, content scoring may be used to evaluate whether a writer has successfully used source material — for example, test questions that require students to read one or more passages and include relevant information from these sources in an effective response.
ETS has also designed an algorithm that can quantify the extent to which test takers make appropriate use of the given sources in their responses. For example, it can quantify not only how much of the information from a specific source was used in the response but also the importance of that information (Beigman Klebanov et al., 2014).
Our current research also focuses on extending the range of applications for automated content assessment. For example, we are investigating how best to use automated systems to provide feedback — for example, on content knowledge or use of sources — in the classroom or in online classes.
Featured Publications
Below are some recent or significant publications that our researchers have authored on the subject of automated scoring of written content.
2018
-
Validation of Automated Scoring for a Formative Assessment That Employs Scientific Argumentation
L. Mao, O. L. Liu, K. Roohr, V. Belur, M. Mulholland, H.-S. Lee, & A. Pallant
Journal of Educational Assessment, Vol. 23, No. 2, pp. 121–138This paper presents the results of collecting preliminary validity evidence to support the use of automated scoring in a computer-based formative assessment to support students' construction and revision of scientific arguments. Learn more about this publication >
2017
-
Investigating Neural Architectures for Short Answer Scoring
B. Riordan, A. Horbach, A. Cahill, T. Zesch, & C. M. Lee
Paper in Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 159–168This paper investigates how several neural network approaches similar to those used for automated essay scoring perform on short answer scoring. Learn more about this publication >
- A Large Scale Quantitative Exploration of Modeling Strategies for Content Scoring
N. Madnani, A. Loukina, & A. Cahill
Paper in Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 457–467We explore supervised learning strategies for automated scoring of content knowledge for a large corpus of 130 different content-based questions spanning four subject areas (science, math, English language arts, and social studies) and containing over 230,000 responses scored by human raters. Learn more about this publication >
-
Investigating the Impact of Automated Feedback on Students’ Scientific Argumentation
M. Zhu, H.-S. Lee, T. Wang, O. L. Liu, V. Belur, & Amy Pallant
International Journal of Science Education, Vol. 39, No. 12, pp. 1648–1668This paper investigates the role of automated scoring and feedback in supporting students' construction of written scientific arguments while learning about factors that affect climate change in the classroom. Learn more about this publication >
2016
-
Automatically Scoring Tests of Proficiency in Music Instruction
N. Madnani, A. Cahill, & B. Riordan
Paper in Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 217–222This paper presents preliminary work on automatically scoring constructed responses elicited as part of a certification test designed to measure the effectiveness of the test-taker as a K-12 music teacher. Learn more about this publication >
-
Use of Automated Scoring and Feedback in Online Interactive Earth Science Tasks
M. Zhu, O. L. Liu, L. Mao, & A. Pallant
Paper in Proceedings of the 2016 IEEE Integrated STEM Education Conference, pp. 224–230In this paper, the authors analyzed log data to examine the granularity of students' interactions with automated scores and feedback and investigated the association between various students' behaviors and their science performance. Learn more about this publication >
-
Validation of Automated Scoring of Science Assessments
O. L. Liu, J. A. Rios, M. Heilman, L. Gerard, & M. C. Linn
Journal of Research in Science TeachingThis paper presents results on automated scoring of eight science items that require students to use evidence to explain complex phenomena. Learn more about this publication
2015
-
Effective Feature Integration for Automated Short Answer Scoring
K. Sakaguchi, M. Heilman, & N. Madnani
Paper in Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language TechnologiesThis paper explores methods for combining scoring guidelines and exemplars with labeled responses for automated scoring of short-answer responses. Learn more about this publication
-
The Impact of Training Data on Automated Short Answer Scoring Performance
M. Heilman & N. Madnani
Paper in Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational ApplicationsThis paper explores a suite of experiments designed to improve our understanding of how training-set size and other factors relate to system performance in short-answer scoring. Learn more about this publication
-
Reducing Annotation Efforts in Supervised Short Answer Scoring
T. Zesch, M. Heilman, & A. Cahill
Paper in Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational ApplicationsThis paper investigates the potential of semi-supervised learning based on clustering to reduce the costs of gathering labeled data typically required to build supervised models for short-answer scoring. Learn more about this publication <</p>
2014
-
Content Importance Models for Scoring Writing From Sources
B. Beigman Klebanov, N. Madnani, J. Burstein, & S. Somasundaran
Paper in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 247–252The authors describe an integrative summarization task used in an assessment of English proficiency for nonnative speakers applying to higher education institutions in the United States. They evaluated a variety of models for capturing the use of source materials in the summaries that help predict success on the task. Learn more about this publication
-
Automated Scoring of Constructed-Response Science Items: Prospects and Obstacles
O. L. Liu, C. Brew, J. Blackmore, L. Gerard, J. Madhok, & M. C. Linn
Educational Measurement: Issues and Practice, vol. 33, no. 2, pp. 19–28This study tested a concept-based scoring tool for automated content-based scoring. Learn more about this publication
2013
-
Automated Short Answer Scoring
C. Brew & C. Leacock
In Handbook of Automated Essay Evaluation: Current Applications and New Directions, New York, NY: Routledge.This book chapter introduces the technology used for short-answer scoring and explains some of its uses. Learn more about this publication
-
HENRY-CORE: Domain Adaptation and Stacking for Text Similarity
M. Heilman & N. Madnani
Paper in Proceedings of the Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual SimilarityThis paper describes a system for automatically measuring the semantic similarity between two texts, which was the aim of the 2013 Semantic Textual Similarity (STS) task. Learn more about this publication
-
ETS: Domain Adaptation and Stacking for Short-Answer Scoring
M. Heilman & N. Madnani
Paper in Proceedings of the Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval), 2013, vol. 2, pp. 275–279This paper describes a system for automated scoring of short answers, which was the aim of the 2013 Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge. Learn more about this publication
-
Automated Scoring of a Summary Writing Task Designed to Measure Reading Comprehension
N. Madnani, J. Burstein, J. Sabatini, & T. O'Reilly
Paper in Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 163–168In this paper, the authors introduce a cognitive framework for measuring reading comprehension that includes the use of novel summary writing tasks. Learn more about this publication
2012
-
Towards Effective Tutorial Feedback for Explanation Questions: A Dataset and Baselines
M. O. Dzikovska, R. D. Nielsen, & C. Brew
Paper in Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 200–210
The authors propose a new shared task for grading student answers, where the goal is to enable targeted and flexible feedback in the form of a tutorial dialogue. They suggest that this corpus will be of interest to the researchers working in textual entailment and will stimulate new developments both in natural language processing in tutorial dialogue systems and textual entailment, contradiction detection and other techniques of interest for a variety of computational linguistics tasks. Learn more about this publication
-
Measuring the Use of Factual Information in Test-Taker Essays
B. Beigman-Klebanov & D. Higgins
Paper in Proceedings of the 7th Workshop on the Innovative Use of NLP for Building Educational Applications, pp. 63–72The authors studied how to measure the use of factual information in test-taker essays and how to assess its effectiveness when predicting essay scores. The article also discusses implications for development of automated essay scoring systems. Learn more about this publication
2010
-
Building a Textual Entailment Suite for Evaluating Content Scoring Technologies
J. Sukkarieh & E. Bolge
Paper in Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC), Valetta, MaltaThis paper presents a new data set designed to be used as a basis for comparative evaluations of short-answer scoring systems.
2009
-
c-rater™: Automatic Content Scoring for Short Constructed Responses
J. Z. Sukkarieh & J. Blackmore
Paper in Proceedings of the 22nd International FLAIRS Conference. Association for the Advancement of Artificial Intelligence, pp. 290–295.This paper describes some of the recent major developments made in c-rater, a technology at ETS used for automatic content scoring for short, freetext responses. Learn more about this publication
2008
-
Effect of Immediate Feedback and Revision on Psychometric Properties of Open-Ended GRE® Subject Test Items
Y. Attali & D. Powers
ETS Research Report No. RR-08-21In this study, registered examinees for the GRE Subject Tests in Biology and Psychology participated in a web-based experiment where they answered open-ended questions that were automatically scored by the c-rater scoring engine. Study participants received immediate feedback and an opportunity to revise their answers. Learn more about this publication
-
Leveraging c-rater's Automated Scoring Capability for Providing Instructional Feedback for Short Constructed Responses
J. Z. Sukkarieh & E. Bolge
Lecture Notes in Computer Science: Vol. 5091. Proceedings of the 9th International Conference on Intelligent Tutoring Systems, ITS 2008, pp. 779–783, New York, NY: SpringerThis paper describes ETS's c-rater engine and considers its potential as an instructional tool. Learn more about this publication
2003
-
c-rater: Scoring of Short-Answer Questions
C. Leacock & M. Chodorow
Computers and the Humanities, Vol. 37, pp. 389–405In this article, the authors describe the c-rater engine's use in two studies, one involving the National Assessment for Educational Progress (NAEP) and the other a statewide assessment in Indiana. Learn more about this publication
Find More Articles
View more research publications related to automated scoring of written content.