Automated Scoring of Written Content
When test takers respond correctly to multiple-choice questions aimed at gauging whether they understand a concept or have understood ideas in a passage that they read, they have shown the ability to recognize and select the answer from a list of options. But it is hard to say with certainty whether they would have been able to choose the correct answer if it had not been included as one of the options. If we allow students to write a sentence or to "fill in the blank," their understanding can be tested. So, if we are able to control the costs of grading and the time that it takes to return the results, short written answers, consisting of a few sentences of text, would often be preferable to multiple-choice responses.
ETS has developed a technology, the c-rater™ automated scoring engine, that can accurately score the content of short written responses. The c-rater engine has been validated on responses from multiple testing programs and in many different content areas, including science, reading comprehension and history.
The c-rater engine's technology uses natural language processing to assess whether a student response contains text that corresponds to the concepts listed in the rubric for an item. To identify these concepts, the c-rater engine applies a sequence of natural language processing steps, including:
- correcting students' spelling
- determining the grammatical structure of each sentence
- resolving pronoun reference
- reasoning about words and their senses
The c-rater engine's use of deep linguistic analysis ensures that it can avoid being misled by responses that use the right words in the wrong context. It is common for students to produce responses of exactly this type. Purely statistical approaches based on words, such as latent semantic analysis, do not have access to the grammatical information that is needed, so will frequently be misled.
Current research focuses on extending the range of applications for automated short-answer assessment. In the classroom, or in online classes, it can be arranged for the computer to provide not only a score, but also feedback on particular aspects of the student’s performance. Because the feedback will not always be perfect, it must be used judiciously, but studies have shown that automated feedback can be a valuable complement to the work of a human instructor.
Below are some recent or significant publications that our researchers have authored on the subject of automated scoring of written content.
Content Importance Models for Scoring Writing From Sources
B. Beigman Klebanov, N. Madnani, J. Burstein, & S. Somasundaran
Paper in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 247–252
The authors describe an integrative summarization task used in an assessment of English proficiency for nonnative speakers applying to higher education institutions in the United States. They evaluated a variety of models for capturing the use of source materials in the summaries that help predict success on the task. View citation record >
Automated Measures of Specific Vocabulary Knowledge from Constructed Responses ("Use These Words to Write a Sentence Based on this Picture")
S. Somasundaran & M. Chodorow
Paper in Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 1–11
This paper describes a system for automatically scoring a vocabulary item type that asks test- takers to use two specific words in writing a sentence based on a picture. The system focuses on vocabulary in English proficiency tests for nonnative speakers. View citation record >
Word Association Profiles and their Use for Automated Scoring of Essays
B. Beigman-Klebanov & M. Flor
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: Sofia, Bulgaria, Aug. 4–9, 2013, pp. 1148–1158
The authors describe a new representation of the content vocabulary of a text (here called word association profile) that captures the proportions of highly associated, mildly associated, unassociated, and disassociated pairs of words that co-exist in the given text. Higher-scoring essays (from a sample written by college graduates on a number of general topics) tend to have higher percentages of both highly associated and disassociated pairs, and lower percentages of mildly associated pairs of words. View citation record >
Holistic Annotation of Discourse Coherence Quality in Noisy Essay Writing
J. Burstein, J. Tetreault, & M. Chodorow
In the Special issue of Dialogue and Discourse on: Beyond semantics: the challenges of annotating pragmatic and discourse phenomena (Eds. S. Dipper, H. Zinsmeister, & B. Webber). Discourse & Dialogue (2), 34–52.
This paper reviews annotation schemes used for labeling discourse coherence in well-formed and noisy (essay) data, and it describes a system that has been developed for automated holistic scoring of essay coherence. View citation record >
Using Pivot-based Paraphrasing and Sentiment Profiles to Improve a Subjectivity Lexicon for Essay Data
B. Beigman Klebanov, N. Madnani, & J. Burstein
Transactions of the Association for Computational Linguistics 1: 99–110
This paper describes a method of improving a seed sentiment lexicon developed on essay data by using a pivot-based paraphrasing system for lexical expansion coupled with sentiment profile enrichment using crowdsourcing. View citation record >
Automated Evaluation of Discourse Coherence Quality in Essay Writing
J. Burstein, J. Tetreault, M. Chodorow, D. Blanchard, & S. Andreyev
In M. D. Shermis, & J. Burstein (Eds.), Handbook of Automated Essay Scoring: Current Applications and Future Directions. New York: Routledge.
In this handbook chapter, the evaluation of discourse coherence is examined. Topics discussed include: different perspectives that aim to define text coherence; the tangible criteria that illustrate discourse coherence quality in scoring rubrics; linguistic properties in essay data that contribute to discourse coherence; and a description of how these features can be modeled to build coherence evaluation systems. View citation record >
HENRY-CORE: Domain Adaptation and Stacking for Text Similarity
M. Heilman & N. Madnani
Proceedings of the Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity.
This paper describes a system for automatically measuring the semantic similarity between two texts, which was the aim of the 2013 Semantic Textual Similarity (STS) task. View citation record >
Argumentation-Relevant Metaphors in Test-Taker Essays
B. Beigman Klebanov & M. Flor
Proceedings of the First Workshop on Metaphor in NLP, pp. 11–20, Atlanta, Ga. Association for Computational Linguistics
This article discusses metaphor annotation in a corpus of argumentative essays. A metaphor annotation protocol that targets metaphors that are relevant for the writer’s arguments is described, as are findings regarding the potential of using metaphor identification in automated essay scoring. View citation record >
Lexical Tightness and Text Complexity
M. Flor, B. Beigman Klebanov, & K. M. Sheehan
Proceedings of the Second Workshop of Natural Language Processing for Improving Textual Accessibility (NLP4ITA), pp. 49–58, Atlanta, Ga. Association for Computational Linguistics.
This paper presents a computational notion of Lexical Tightness that measures global cohesion of content words in a text. Lexical tightness represents the degree to which a text tends to use words that are highly inter-associated in the language. View citation record >
Towards Effective Tutorial Feedback for Explanation Questions: A Dataset and Baselines
M. O. Dzikovska, R. D. Nielsen, and C. Brew
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 200–210
Association for Computational Linguistics
The authors propose a new shared task for grading student answers, where the goal is to enable targeted and flexible feedback in the form of a tutorial dialogue. They suggest that this corpus will be of interest to the researchers working in textual entailment and will stimulate new developments both in natural language processing in tutorial dialogue systems and textual entailment, contradiction detection and other techniques of interest for a variety of computational linguistics tasks.
Identifying High-Level Organizational Elements in Argumentative Discourse
N. Madnani, M. Heilman, J. Tetreault, and M. Chodorow (2012)
Proceedings of the 2012 Meeting of the North American Association for Computational Linguistics: Human Language Technologies (NAACL-HLT)
Publisher: Association for Computational Linguistics
This paper discusses argumentative discourse and the benefit of differentiating between language that expresses claims and evidence and language that can be used to organize such claims and pieces of evidence. The authors suggest automation as a way to detect high-level organizational elements in an argumentative discourse that combines a rule-based system with a probabilistic sequence model. View citation record >
Measuring the Use of Factual Information in Test-Taker Essays
B. Beigman-Klebanov and D. Higgins
Proceedings of the 7th Workshop on the Innovative Use of NLP for Building Educational Applications, pp. 63–72
The authors studied how to measure the use of factual information in test-taker essays and how to assess its effectiveness when predicting essay scores. The article also discusses implications for development of automated essay scoring systems. View citation record >
c-rater™: Automatic Content Scoring for Short Constructed Responses
J. Z. Sukkarieh & J. Blackmore
Proceedings of the 22nd International FLAIRS Conference. Association for the Advancement of Artificial Intelligence, 2009, pp. 290–295
This paper describes some of the recent major developments made in c-rater, a technology at Educational Testing Service (ETS) used for automatic content scoring for short, freetext responses. View citation record >
Effect of Immediate Feedback and Revision on Psychometric Properties of Open-Ended GRE® Subject Test Items
Y. Attali & D. Powers
ETS Research Report No. RR-08-21
In this study, registered examinees for the GRE Subject Tests in Biology and Psychology participated in a web-based experiment where they answered open-ended questions that were automatically scored by the c-rater scoring engine. Study participants received immediate feedback and an opportunity to revise their answers. View citation record >
Leveraging c-rater's Automated Scoring Capability for Providing Instructional Feedback for Short Constructed Responses
J. Z. Sukkarieh & E. Bolge
Lecture Notes in Computer Science: Vol. 5091. Proceedings of the 9th International Conference on Intelligent Tutoring Systems, ITS 2008, pp. 779–783
This paper describes ETS's c-rater engine and considers its potential as an instructional tool. View citation record >
c-rater: Scoring of Short-Answer Questions
C. Leacock & M. Chodorow
Computers and the Humanities, Vol. 37, pp. 389–405
In this article, the authors describe the c-rater engine's use in two studies, one involving the National Assessment for Educational Progress (NAEP) and the other a statewide assessment in Indiana. View citation record >
Find More Articles
View more research publications related to automated scoring of written content.
Read More from Our Researchers
View a list of current ETS researchers and their work.