How ETS Scores the TOEFL iBT® Test

People in this Video:

Susan Hines, Senior Director at ETS
Pablo Garcia Gomez, Assessment Specialist at ETS
Jonathan Schmidgall, Associate Research Scientist at ETS

Transcript Body

Susan Hines, Senior Director - At Educational Testing Service we're dedicated to advancing quality in education for all learners in every corner of the world, and to give them the opportunity to realize their potential. We have an uncompromising commitment to rigorous research, and to ensuring that we design, develop, and administer assessments that are fair and valid. One of these assessments is the TOEFL iBT® test. The TOEFL® test measures the ability of non-native English speakers to communicate in English, as it's read, written, heard, and spoken in an academic setting.

A test taker's TOEFL score is based on their performance on the whole test. The Reading and Listening sections of the test contain multiple choice tasks for which test takers must select the correct answer. The Speaking & Writing sections contain constructed response tasks, to which test takers respond by speaking the responses into a microphone, or writing about a specific topic. In these tasks students demonstrate their ability to speak and write effectively in an academic setting. Students and teachers frequently have questions about the Speaking & Writing tasks, and how the responses are scored.

Animated character 1 - What are the tasks I need to complete in the TOEFL iBT test, and how can I improve my score on the Speaking & Writing tasks?

Animated character 2 - How are the responses scored and how does ETS ensure consistency?

Animated character 3 - What tools can I use in the classroom to help my students improve their speaking and writing skills?

Susan - In this video you'll learn about the speaking and writing tasks on the TOEFL iBT test, the TOEFL iBT approach to rating test taker responses to the task, and the tools used to rate responses, and finally, using scoring tools as classroom tools to help students improve their speaking and writing skills.

Pablo Garcia Gomez, Assessment Specialist - The Constructed Response part of the TOEFL iBT test consists of two sections. The Speaking section and the Writing section. The Speaking section includes six tasks. Test takers are asked to speak on a variety of topics, to draw on personal experience, campus-based situations, and academic content. The first two questions are called Independent Speaking Tasks. They require test takers to draw entirely on their own ideas, opinions and experiences, when they respond. The other four questions are called integrated speaking tasks. They require test takers to integrate their English language skills by listening and speaking, or listening, reading and speaking, just as they would during class and outside the classroom. The Speaking section is approximately 20 minutes long.

The Writing section consists of two tasks. In the first task, which is called the integrated task, test takers read a passage on a given academic topic and then they listen to a lecture on the same topic. After that, they write a summary that integrates what they read and heard. Test takers have 20 minutes to complete this task. The second task is the independent task. Test takers write an essay based on their own knowledge and experience. They are asked, for example, to state an opinion or choice and support it. Test takers have 30 minutes to complete this task.

Jonathan Schmidgall, Associate Research Scientist - To measure how well someone can do something, and then describe that performance, it's important to design tasks that closely resemble what people need to do in the real world. That's why when the TOEFL test was designed, ETS researchers studied what students actually do in the university setting. This is called Real World Task Analysis. The task analysis is followed by an analysis of the language skills needed to complete the real world tasks. We use this information to create tasks for the test that reflect what students do in an academic setting, to ensure that these test tasks elicit the same language and communication skills needed in the real academic world.

On-screen: [Split screen with two separate animated videos. Video 1 shows four students sitting around a table in a library and talking. Video 2 shows a test taker with headset and microphone working at a computer.]

Jonathan - This is important, because it helps strengthen inferences made about how well someone can perform in the real world based on the results of the test, and by preparing to do well in the test, students are actually preparing to do well in the real world, which has a big positive impact on teaching and learning. These processes for deciding what to include in a test and how to design test tasks are part of an overall framework called Evidence Centered Design, or ECD, that was used for the TOEFL iBT test. ECD is a very rigorous design process that, among other things, insures that claims based on the results of a test are based on the best evidence that can be gathered through the test taker's performance on the test.

On-screen: [Real World Task Analysis.
Analysis of language, and language skills needed to complete the real world tasks.
Evidence-Centered Design claims.
Design of assessment tasks.]

Pablo - TOEFL researchers also studied the features of a language needed to complete the tasks successfully. This informed the design and development of the TOEFL iBT scoring guides, also referred to here as scoring rubrics. This is a scoring guideline with descriptors. It describes performance at different levels of proficiency. This Speaking Scoring Rubric has a general description followed by three scoring dimensions: Delivery, Language Use, and Topic Development.

On-screen: [Sample scoring guideline titled "Integrated Speaking Rubrics" is table with five headers: Score, General Description, Delivery, Language Use, and Topic Development.]

All of the TOEFL iBT scoring guides are holistic rubrics, which means that the level of performance is determined by assessing performance across multiple criteria as a whole. The question we're trying to answer is: Across these dimensions, how well did the test taker complete the task? This is in contrast to an analytic rubric, which articulates levels of performance for each criteria.

The general description is the holistic description of speaking ability at the four score levels or bands. For the Delivery dimension, we consider how pronunciation, intonation, and pacing are delivered for overall intelligibility. In the Language Use dimension, we consider both range and accuracy of grammar, and vocabulary used to complete the task. And lastly, for the Topic Development dimension, we focus on whether the test taker has demonstrated an understanding of the content from the listening and/or reading stimuli, and made appropriate connections to convey relevant information.

The descriptors on the speaking integrated scoring rubric refer to how well the test taker integrates the information that was heard and read into their responses, so we are evaluating the accuracy and relevance of the task-specific content of the responses. Another critical tool used for scoring is benchmark responses. A benchmark response is a real test taker response to a specific test task that serves as a point of reference for the Raters, so the benchmark responses help the ETS Raters make rating decisions about speaking and writing responses, by providing examples of what a Level 4 looks like or sounds like, or what a Level 2 looks like or sounds like.

The scoring processes at ETS are intentionally transparent. We want all score users and test takers to know who scores the responses and understand how they are scored. All the processes in place are to ensure the highest scoring quality. Here's the process for becoming a TOEFL Rater. Raters with the required qualifications are recruited to score either the Speaking section or the Writing section. Every Rater has to complete an intensive ETS training, and pass either a Speaking or a Writing certification test to qualify him or her to score. A Rater cannot score both sections. Further, ETS requires an additional score quality measure by requiring every certified Rater to pass a daily calibration test before he or she begins each scoring session. This daily calibration test assesses a Rater's readiness to score for that specific day. Raters who do not pass the calibration test are not allowed to score on that day. Additionally, a Rater who has not scored in four months or more is required to re-certify to maintain certification for the TOEFL iBT test.

All six questions in the Speaking section, or two questions in the Writing section, are individually scored by Raters independently and anonymously. This unique scoring design is to minimize individual Rater bias. Therefore, it ensures the highest score reliability. For Speaking, there are a minimum of three Raters required to contribute scores for one individual test taker. However, it may be possible for a different Rater to assign scores for every test question. This means, for TOEFL Speaking, there could be as many as six different Raters providing scores for a test, so there are three to six different Raters contributing to the test score for an individual test taker for the Speaking section.

For writing, each response receives one human rating and one automated rating, which is given by the e-rater automated scoring engine. This method combines the judgment of humans for content and meaning, and the consistency of automated scoring for linguistic features. For both Speaking and Writing, some more experienced Raters are asked to take on leadership roles to monitor Raters' scoring quality. One of the responsibilities is to make sure that Raters are strictly following the scoring guidelines and program protocols. Raters and scoring leaders communicate by phone regularly throughout the scoring session, which means Raters receive specific feedback about their scoring performance during every scoring session, in real time. The same people who write the questions are also involved in monitoring the scoring process. An ETS content expert is available to answer questions and provide direction when needed for every scoring session.

Susan - Knowing about the speaking and writing task types on the TOEFL iBT test, and criteria used to evaluate the responses, can help teachers focus on building their students' skills. Using the scoring guides as a foundation and modifying them can be helpful for monitoring student progress, and providing students with feedback about specific areas to improve. Teachers can create rubrics or checklists with criteria for success. Rubrics can help move learning forward. They can be used as a tool to clarify and understand learning goals, to answer the question, "Where am I going?"

They can also be used as a tool for descriptive feedback, to answer the questions, "Where am I now," and, "How can I close the gap?" Giving students feedback about strengths and weaknesses of their performance is much more likely to help them improve than a single number score or grade. Finally, examples of good speaking and writing responses can help students get a clear picture of the expectations for each task. Thank you for joining us. We hope we were able to provide you with some helpful insights into the scoring of the TOEFL iBT test.

On-screen: [https://www.ets.org/toefl/teachers_advisors].

 

End of How ETS Scores the TOEFL iBT® Test video.

Video duration: 11:38.

 

END OF WEBINAR