By: Hans Sandberg
How do you find out how well a language learner can speak English? One way is to give a test and then have a human expert evaluate the learner’s performance, but this method can be expensive if the number of test takers is large. Another way is to digitally record the spoken answers and let computers score them. Focus on ETS R&D asked ETS scientists Keelan Evanini, Spiros Papageorgiou and Klaus Zechner to share some of their thoughts on automated assessment of spoken English.
Why use computers to score spoken English? How is it done and how can the quality be ensured?
Klaus Zechner: The main reason is that computers can provide fast, cost-effective and consistent scores for spoken responses on a large scale. How is it done? Most current automated speech scoring systems use multiple components, including automatic speech recognition, extraction and computation of specific linguistic measures ("speech features") and a scoring model, which predicts how proficient a test taker is, expressed as a score. Scoring models use the extracted features for their predictions and are typically trained on previously scored responses.
How do we ensure the quality of the automated scores? We look at how closely scores by machines match those by human raters and the extent to which the features used in the scoring engine represent the aspects of speaking proficiency that a test measures. We also try to make sure that the scores are equally accurate for different groups of people; for example, groups of test takers with different native languages.
In addition to the scores for spoken responses, computers can also provide detailed feedback on aspects of speaking proficiency, such as fluency, vocabulary diversity, grammatical accuracy and content. That would be difficult and costly if we had a lot of test takers and only used human raters.
Spiros Papageorgiou: The use of computers goes beyond simply providing a test score. Computer-delivered tools and applications have been used for a while in the classroom to give students a chance to hear authentic language when working on language learning tasks. This is especially important for students from an environment where English is not commonly spoken. However, students need more than access to computer-delivered language tasks to improve their English-language proficiency. They need detailed and targeted feedback on their strengths and weaknesses, and guidance on how to improve their proficiency. At the same time, teachers are often too busy during class to provide each individual student with detailed feedback. Computers are becoming an invaluable tool in the classroom, as they can deliver personalized feedback on critical aspects of oral production and interaction.
Keelan Evanini: Automated speech scoring can benefit language learners across a range of contexts. It can provide targeted feedback on the pronunciation of specific words for beginning learners, give feedback on task completion in interactive goal-oriented, simulated conversational tasks for developing learners. It can also contribute to a score in combination with human ratings for a high-stakes assessment of academic English speaking proficiency for advanced learners. Automated scoring, in these and other use cases, can provide language learners and test takers with feedback, and deliver scores much more quickly than if we had to rely solely on human raters. In addition, researchers can analyze the repeated engagement by learners with automated scoring systems to enhance the systems in terms of what aspects of a language learner's speaking proficiency are in the greatest need of improvement. This can lead to more personalized feedback and curricula.
However, it is also important to demonstrate that the system is valid for the intended use cases across all applications of automated speech scoring. This can be done by including measures of all aspects of the speaking construct targeted by specific speaking tasks. If it turns out that we need a more complex analysis of a spoken response to demonstrate the test taker’s proficiency that is beyond current state-of-the-art automated scoring capabilities, such as the appropriate use of argumentation strategies, then we may have to use both automated and human speech scoring to cover the construct fully.
What kind of research is ETS doing on automated speech scoring?
Klaus Zechner: We focus on the conceptualization, development and evaluation of a wide array of "speech features." These measure various aspects of spoken proficiency (e.g., speaking rate, length of sustained speech, diversity of vocabulary, cohesion of spoken discourse, etc.). Different tests, and different items within tests, may target different aspects of speaking proficiency. So we should ideally have a close match between aspects of speech that we measure with an automated system (by means of speech features) and those we consider important, relevant and necessary for a particular language test or test item. For instance, the speaking section of the TOEFL iBT® test evaluates fluency, pronunciation, intonation, vocabulary range, grammar diversity, content and discourse coherence. Over the years, ETS has developed features that can be automatically extracted from a spoken response in all of these areas.
Other important areas of research in this domain are: improving the automated speech recognition component so that the system can transcribe spoken responses more accurately; developing strategies for feature selection and building of scoring models to generate scores that better match human reference scores; and identifying responses that should not be scored automatically due to various deficiencies, such as no audible sound, too much background noise or off-topic responses.
Spiros Papageorgiou: Our automated speech scoring capabilities have also helped us better understand how spoken performance varies across score levels. For example, ETS researchers use automated scoring capabilities to analyze student responses to test tasks, and to identify speech properties that are characteristic of the scores received. Identifying such critical differentiators across score levels can help students improve their oral language proficiency. Such use of automated speech scoring capabilities can also help researchers and test developers validate the content of the scoring rubrics by checking if they include properties found to be typical of spoken responses at specific score levels.
Keelan Evanini: When we began to research automated speech scoring, we focused on aspects of an English learner's delivery that made it easy to understand their spoken English, including fluency, pronunciation and prosody. The initial focus on delivery was due to the fact that these aspects of the speaking construct are most suitable to automated evaluation, and that we could build reliable systems based on them. Recently, we have expanded the construct coverage to include grammar, vocabulary, content appropriateness and discourse coherence, which greatly improved our scoring system’s validity. In the past few years, we have also begun to explore using automated scoring of spoken English in interactive, dialogic conversations, while previous research was conducted solely on monologic responses in the form of isolated utterances. This innovative research thereby makes use of spoken dialog systems to deliver conversational items, in which language learners interact with a computer-simulated conversational partner to complete a task, such as ordering food in a restaurant or interviewing for a job.
What should institutions, corporations and policymakers know about automated speech scoring systems?
Klaus Zechner: There are several questions that score users should ask companies that develop and sell automated speech scoring systems:
- Is the underlying automated scoring methodology accessible for review or hidden in a "black box?" We need transparency to understand how the scores are produced and if they are meaningful in a substantive way.
- Have the automated scores been validated against external measures as we do with human scores? Such external criteria can, for example, include scores on other test sections and grades in relevant academic classes.
- What population was the system trained on and how closely does this population match the intended population (e.g., children vs. adults, one or more native language backgrounds, proficiency distribution, etc.)?
- What are the task types of interest; for example, is the focus on predictable speech such as read-aloud tasks or more open-ended, fairly unpredictable speech?
- What are the stakes of the assessment? What are the consequences for test takers if automated scoring systems provide suboptimal scores in some instances?
- Can the system be used together with human raters (e.g., in a contributory scoring approach, where both human and machine scores are combined to produce a final score)?
- How well does the system perform on the target population; for example, does it agree with scores by human raters?
- How well can the system flag responses that have suboptimal characteristics (such as a high noise level) and therefore should not be scored by the system?
- How does the automated speech scoring system gather evidence on the various speech proficiency components of interest (e.g., features computed on fluency, pronunciation, grammatical accuracy, etc.)?
- How resource-intensive is the system in terms of computer processing time, memory and disk space?
- What effort does it take to adapt the system to the specific needs of a particular deployment (e.g., different task types, different target population, etc.)?
Spiros Papageorgiou: These systems offer countless opportunities for improving language learning and assessment, but Klaus’s list makes it clear that we should not treat them as "one size fits all" solutions. English-language proficiency tests, for example, aim to let international students demonstrate their ability to cope with instructions given in English, which is something admissions officers and academic advisors need to know. This means that test tasks should be chosen based on how well they represent the target language tasks, not whether they can be scored with automated scoring systems. A task where a student reads a text aloud may be well suited for an automated scoring system, but it does not reveal a student’s ability to give a presentation in a less scripted, and therefore more realistic, situation. We at ETS think it is crucial that our automated scoring capabilities provide institutions, corporations and policymakers with test scores that are valid and reliable for their intended uses.
Keelan Evanini: These systems have improved tremendously, both in terms of reliability and validity, due to advances in automatic speech recognition and linguistic analysis of spoken language. These systems can now be considered for a wide range of cases spanning many different types of speaking tasks and contexts for language learning and assessment. That being said, users of this technology should understand that it still has limitations compared to human raters, especially with respect to the assessment of the appropriateness of the content and how coherent the discourse is. Institutions and policymakers should carefully evaluate the validity of automated speech scores if these aspects of the speaking proficiency are an important component of the construct they want to assess. This is especially true when considering using an automated speech scoring system in assessments that inform high-stakes decisions. The optimal solution may, in some cases, be a hybrid approach that combines the strengths of automated and human scoring. Also, since it is likely that these systems improve over time, it will be important to reevaluate them on a recurring basis.
Keelan Evanini is Director of Research in the Natural Language Processing (NLP) and Speech group, and the Dialogic, Multimodal and Speech (DIAMONDS) group at ETS's R&D division. Spiros Papageorgiou is a Managing Senior Research Scientist in the English Language Learning and Assessment (ELLA) group at ETS's R&D division. Klaus Zechner is a Managing Senior Research Scientist in the NLP and Speech group, and the DIAMONDS group.
Language learning assessments
Recent ETS Research Reports of English-language assessment
- Evanini, K., Hauck, M. C. & Hakuta, K. (2017). Approaches to Automated Scoring of Speaking for K–12 English Language Proficiency Assessments (ETS Research Report No. RR-17-18).
- Suendermann-Oeft, D., Ramanarayanan, V., Yu, Z., Qian, Y., Evanini, K., Lange, P., Wang, X., & Zechner, K. (2017). A Multimodal Dialog System for Language Assessment: Current State and Future Directions (ETS Research Report No. RR-17-21).