angle-up angle-right angle-down angle-left close user menu open menu closed search globe bars phone store

Automated Scoring of Speech

ETS's SpeechRaterengine is the world's most advanced spoken-response scoring application targeted to score spontaneous responses, in which the range of valid responses is open-ended rather than narrowly determined by the item stimulus. Test takers preparing to take the TOEFL® test have had their responses scored by the SpeechRater engine as part of the TOEFL Practice Online (TPO™) practice tests since 2006. Competing capabilities focus on assessing low-level aspects of speech production such as pronunciation by using restricted tasks in order to increase reliability. The SpeechRater engine, by contrast, is based on a broad conception of the construct of English-speaking proficiency, encompassing aspects of speech delivery (such as pronunciation and fluency), grammatical facility and higher-level abilities related to topical coherence and the progression of ideas.

The SpeechRater engine processes each response with an automated speech recognition system specially adapted for use with nonnative English. Based on the output of this system, natural language processing (NLP) and speech-processing algorithms are used to calculate a set of features that define a "profile" of the speech on a number of linguistic dimensions, including fluency, pronunciation, vocabulary usage, grammatical complexity and prosody. A model of speaking proficiency is then applied to these features in order to assign a final score to the response. While this model is trained on previously observed data scored by human raters, it is also reviewed by content experts to maximize its validity. Furthermore, if the response is found to be unscorable due to audio quality or other issues, the SpeechRater engine can set it aside for special processing.

ETS's research agenda related to automated scoring of speech includes the development of more extensive NLP features to represent pragmatic competencies and the discourse structure of spoken responses. The core capability has also been extended to apply across a range of item types used in different assessments of English proficiency, very restricted item types (such as passage read-alouds), or less restricted items (such as summarization tasks).

Featured Publications

Below are some recent or significant publications that our researchers have authored on the subject of automated scoring of speech, spoken dialog systems, and multimodal assessments.




  • Automated Scoring Across Different Modalities
    A. Loukina & A. Cahill
    Paper in Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 130–135

    This article investigates the application of automated scoring systems that were originally developed for scoring text (short answers and essays) to non-native spoken English in combination with the SpeechRater automated scoring service. Learn more about this publication >

  • Self-Adaptive DNN for Improving Spoken Language Proficiency Assessment
    Y. Qian, X. Wang, K. Evanini, & D. Suendermann-Oeft
    Paper in Proceedings of the 17th Annual Conference of the International Speech Communication Association (INTERSPEECH 2016),  pp. 3122–3126

    This article presents how a self-adaptive DNN trained with i-vectors on a corpus of non-native speech can improve the performance of an automated speech recognizer and an automated speech scoring system. Learn more about this publication >





  • A Comparison of Two Scoring Methods for an Automated Speech Scoring System
    X. Xi, D. Higgins, K. Zechner, & D. Williamson
    Language Testing, Vol. 29, No. 3, pp. 371–394

    In this paper, researchers compare two alternative scoring methods for an automated scoring system for speech. The authors discuss tradeoffs between multiple regression and classification tree models. Learn more about this publication >

  • Exploring Content Features for Automated Speech Scoring
    S. Xie, K. Evanini, & K. Zechner
    Paper in Proceedings of the 2012 Conferece of the North American Association for Computational Linguistics: Human Language Technologies, pp. 103-111

    Researchers explore content features for automated speech scoring in this paper about automated scoring of unrestricted spontaneous speech. The paper compares content features based on three similarity measures in order to understand how well content features represent the accuracy of the content of a spoken response. Learn more about this publication >

  • Assessment of ESL Learners' Syntactic Competence Based on Similarity Measures
    S. Yoon & S. Bhat
    Paper in Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. pp. 600-608

    In this paper, researchers present a method that measures English language learners' syntactic competence for the automated speech scoring systems. The authors discuss the advantage of the current natural-language processing technique-based and corpus-based measures over the conventional ELL measures. Learn more about this publication >



Spoken Dialog Systems




Multimodal Assessments



  • MAP: Multimodal Assessment Platform for Interactive Communication Competency
    S. Khan, D. Suendermann-Oeft, K. Evanini, D. Williamson, S. Paris, Y. Qian, Y. Huang, P. Bosch, S. D'Mello, A. Loukina, & L. Davis
    In S. Shehata & J. P.-L. Tan (Eds.), Practitioner Track Proceedings of the 7th International Learning Analytics & Knowledge Conference, pp. 6–12

    In this paper, we describe a prototype system for automated, interactive human communication assessment. The system processes multimodal data captured in a variety of human-human and human-computer interactions, integrates speech and face recognition-based biometric capabilities, and ranks and indexes large collections of assessment content. Learn more about this publication >

  • Crowdsourcing Ratings of Caller Engagement in Thin-Slice Videos of Human-Machine Dialog: Benefits and Pitfalls
    V. Ramanarayanan, C. Leong, D. Suendermann-Oeft, & K. Evanini
    Paper in Proceedings of ICMI 2017, 19th ACM International Conference on Multimodal Interaction, pp. 281–287

    We analyze the efficacy of different crowds of naïve human raters in rating engagement during human–machine dialog interactions. Each rater viewed multiple 10 second, thin-slice videos of native and non-native English speakers interacting with a computer-assisted language learning (CALL) system and rated how engaged and disengaged those callers were while interacting with the automated agent. Learn more about this publication >

  • Crowdsourcing Multimodal Dialog Interactions: Lessons Learned From the HALEF Case
    V. Ramanarayanan, D. Suendermann-Oeft, H. Molloy, E. Tsuprun, P. Lange, & K. Evanini
    Paper in Proceedings of the Workshop on Crowdsourcing, Deep Learning and Artificial Intelligence Agents at the Thirty-First AAAI Conference on Artificial Intelligence, pp. 423–431

    We present a retrospective on collecting data of human interactions with multimodal dialog systems (“dialog data”) using crowdsourcing techniques. This is largely based on our experience using the HALEF multimodal dialog system to deploy education-domain conversational applications on the Amazon Mechanical Turk crowdsourcing platform.  Learn more about this publication >

  • An Open-Source Dialog System With Real-Time Engagement Tracking for Job Interview Training Applications
    Z. Yu, V. Ramanarayanan, P. Lange, & D. Suendermann-Oeft
    Paper in Proceedings of IWSDS 2017, International Workshop on Spoken Dialog Systems, pp. 1–9

    We designed and implemented a dialog system that tracks and reacts to a user’s state, such as engagement, in real time. We designed and implemented a conversational job interview task based on the proposed framework. The system acts as an interviewer and reacts to user’s disengagement in real-time with positive feedback strategies designed to re-engage the user in the job interview process. Learn more about this publication >


  • Assembling the Jigsaw: How Multiple Open Standards Are Synergistically Combined in the HALEF Multimodal Dialog System
    V. Ramanarayanan, D. Suendermann-Oeft, P. Lange, R. Mundkowsky, A. Ivanov, Z. Yu, Y. Qian, & K. Evanini
    In D. Dahl (Ed.), Multimodal Interaction With W3C Standards: Towards Natural User Interfaces to Everything, Springer, pp. 295–310

    In this chapter, we examine how an open source, modular, multimodal dialog system—HALEF—can be seamlessly assembled, much like a jigsaw puzzle, by putting together multiple distributed components that are compliant with the W3C recommendations or other open industry standards. Learn more about this publication >

  • Multimodal HALEF: An Open-Source Modular Web-Based Multimodal Dialog Framework
    Z. Yu, V. Ramanarayanan, R. Mundkowsky, P. Lange, A. Ivanov, A. Black, & D. Suendermann-Oeft
    Paper in Proceedings of the IWSDS 2016, International Workshop on Spoken Dialog Systems, pp. 1–11

    We describe recent developments and preliminary research results on extending the HALEF spoken dialog system to other modalities, in particular the capture of video feeds in web browser sessions.  This technology enables the roll-out of multimodal dialog systems to a massive user base, as exemplified by the use of Amazon Mechanical Turk for data collection using Multimodal HALEF. Learn more about this publication >


Find More Articles

View more research publications related to automated scoring of speech.