Automated Scoring of Written Content

When test takers respond correctly to multiple-choice questions aimed at gauging whether they understand a concept or have understood ideas in a passage that they read, they have shown the ability to recognize and select the answer from a list of options. But it is hard to say with certainty whether they would have been able to choose the correct answer if it had not been included as one of the options. If we allow students to write a sentence or to "fill in the blank," their understanding can be tested. So, if we are able to control the costs of grading and the time that it takes to return the results, short written answers, consisting of a few sentences of text, would often be preferable to multiple-choice responses.

ETS has developed a technology, the c-rater™ automated scoring engine, that can accurately score the content of short written responses. The c-rater engine has been validated on responses from multiple testing programs and in many different content areas, including science, reading comprehension and history.

The c-rater engine's technology uses natural language processing to assess whether a student response contains text that corresponds to the concepts listed in the rubric for an item. To identify these concepts, the c-rater engine applies a sequence of natural language processing steps, including:

  • correcting students' spelling
  • determining the grammatical structure of each sentence
  • resolving pronoun reference
  • reasoning about words and their senses

The c-rater engine's use of deep linguistic analysis ensures that it can avoid being misled by responses that use the right words in the wrong context. It is common for students to produce responses of exactly this type. Purely statistical approaches based on words, such as latent semantic analysis, do not have access to the grammatical information that is needed, so will frequently be misled.

Current research focuses on extending the range of applications for automated short-answer assessment. In the classroom, or in online classes, it can be arranged for the computer to provide not only a score, but also feedback on particular aspects of the student’s performance. Because the feedback will not always be perfect, it must be used judiciously, but studies have shown that automated feedback can be a valuable complement to the work of a human instructor.

Featured Publications

Below are some recent or significant publications that our researchers have authored on the subject of automated scoring of written content.

2013

  • Holistic Annotation of Discourse Coherence Quality in Noisy Essay Writing
    J. Burstein, J. Tetreault, & M. Chodorow
    In the Special issue of Dialogue and Discourse on: Beyond semantics: the challenges of annotating pragmatic and discourse phenomena (Eds. S. Dipper, H. Zinsmeister, & B. Webber). Discourse & Dialogue (2), 34–52.

    This paper reviews annotation schemes used for labeling discourse coherence in well-formed and noisy (essay) data, and it describes a system that has been developed for automated holistic scoring of essay coherence. View citation record >

  • Using Pivot-based Paraphrasing and Sentiment Profiles to Improve a Subjectivity Lexicon for Essay Data
    B. Beigman Klebanov, N. Madnani, & J. Burstein
    Transactions of the Association for Computational Linguistics 1: 99–110

    This paper describes a method of improving a seed sentiment lexicon developed on essay data by using a pivot-based paraphrasing system for lexical expansion coupled with sentiment profile enrichment using crowdsourcing. View citation record >

  • Automated Evaluation of Discourse Coherence Quality in Essay Writing
    J. Burstein, J. Tetreault, M. Chodorow, D. Blanchard, & S. Andreyev
    In M. D. Shermis, & J. Burstein (Eds.), Handbook of Automated Essay Scoring: Current Applications and Future Directions. New York: Routledge.

    In this handbook chapter, the evaluation of discourse coherence is examined. Topics discussed include: different perspectives that aim to define text coherence; the tangible criteria that illustrate discourse coherence quality in scoring rubrics; linguistic properties in essay data that contribute to discourse coherence; and a description of how these features can be modeled to build coherence evaluation systems. View citation record >

  • HENRY-CORE: Domain Adaptation and Stacking for Text Similarity
    M. Heilman & N. Madnani
    Proceedings of the Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity.

    This paper describes a system for automatically measuring the semantic similarity between two texts, which was the aim of the 2013 Semantic Textual Similarity (STS) task. View citation record >

  • Argumentation-Relevant Metaphors in Test-Taker Essays
    B. Beigman Klebanov & M. Flor
    Proceedings of the First Workshop on Metaphor in NLP, pp. 11–20, Atlanta, Ga. Association for Computational Linguistics

    This article discusses metaphor annotation in a corpus of argumentative essays. A metaphor annotation protocol that targets metaphors that are relevant for the writer’s arguments is described, as are findings regarding the potential of using metaphor identification in automated essay scoring. View citation record >

  • Lexical Tightness and Text Complexity
    M. Flor, B. Beigman Klebanov, & K. M. Sheehan
    Proceedings of the Second Workshop of Natural Language Processing for Improving Textual Accessibility (NLP4ITA), pp. 49–58, Atlanta, Ga. Association for Computational Linguistics.

    This paper presents a computational notion of Lexical Tightness that measures global cohesion of content words in a text. Lexical tightness represents the degree to which a text tends to use words that are highly inter-associated in the language. View citation record >

2012

  • Towards Effective Tutorial Feedback for Explanation Questions: A Dataset and Baselines
    M. O. Dzikovska, R. D. Nielsen, and C. Brew
    Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 200–210
    Association for Computational Linguistics

    The authors propose a new shared task for grading student answers, where the goal is to enable targeted and flexible feedback in the form of a tutorial dialogue. They suggest that this corpus will be of interest to the researchers working in textual entailment and will stimulate new developments both in natural language processing in tutorial dialogue systems and textual entailment, contradiction detection and other techniques of interest for a variety of computational linguistics tasks.

  • Identifying High-Level Organizational Elements in Argumentative Discourse
    N. Madnani, M. Heilman, J. Tetreault, and M. Chodorow (2012)
    Proceedings of the 2012 Meeting of the North American Association for Computational Linguistics: Human Language Technologies (NAACL-HLT)
    Publisher: Association for Computational Linguistics

    This paper discusses argumentative discourse and the benefit of differentiating between language that expresses claims and evidence and language that can be used to organize such claims and pieces of evidence. The authors suggest automation as a way to detect high-level organizational elements in an argumentative discourse that combines a rule-based system with a probabilistic sequence model. View citation record >

  • Measuring the Use of Factual Information in Test-Taker Essays
    B. Beigman-Klebanov and D. Higgins
    Proceedings of the 7th Workshop on the Innovative Use of NLP for Building Educational Applications, pp. 63–72

    The authors studied how to measure the use of factual information in test-taker essays and how to assess its effectiveness when predicting essay scores. The article also discusses implications for development of automated essay scoring systems. View citation record >

2009

  • c-rater™: Automatic Content Scoring for Short Constructed Responses
    J. Z. Sukkarieh & J. Blackmore
    Proceedings of the 22nd International FLAIRS Conference. Association for the Advancement of Artificial Intelligence, 2009, pp. 290–295

    This paper describes some of the recent major developments made in c-rater™, a technology at Educational Testing Service (ETS) used for automatic content scoring for short, freetext responses. View citation record >

2008

2003

  • c-rater: Scoring of Short-Answer Questions
    C. Leacock & M. Chodorow
    Computers and the Humanities, Vol. 37, pp. 389–405

    In this article, the authors describe the c-rater engine's use in two studies, one involving the National Assessment for Educational Progress (NAEP) and the other a statewide assessment in Indiana. View citation record >

Find More Articles

View more research publications related to automated scoring of written content.

Read More from Our Researchers

View a list of current ETS researchers and their work.