angle-up angle-right angle-down angle-left close user menu open menu closed search globe bars phone store

Using Language Proficiency Standards to Interpret Test Scores

Focus on R&D

Issue 13

May 2019


By: Hans Sandberg

Does a test score of, let's say, 70 mean that a nonnative English speaker is ready for a university where English is the primary language of instruction? How about participating in a business meeting conducted in English? Focus on ETS R&D talked to four ETS experts involved with research on how tests in the TOEFL® and TOEIC® families of English-language assessments align with curricula and educational standards, such as the Common European Framework of Reference (CEFR) and China's Standards of English (CSE).

Why align assessments, like the TOEFL and TOEIC tests, with external standards and curricula?

Photo of Richard Tannenbaum Richard Tannenbaum: Test scores do not always give you the kind of information you need to make an informed decision. For example, what does it mean that one learner scored 70 and another got 65? It is safe to say that the first scored higher than the second, but we would typically not think either of the results as good if the score scale for the test ranged from 0 to 200. Both would, on the other hand, have done well if the scale went from 0 to 80 and the average was 50. This shows that you need context to interpret the meaning of test results. Then we have the question of what a score means outside of the test. What does it mean if a learner scores 20 on the TOEFL iBT® speaking test, whose scores can range from 0 to 30? Does that allow us to assume that she is able to engage in a conversation with a native English speaker, or make a short presentation on an unfamiliar topic? It's hard to know simply by looking at the test score.

Photo of Veronika Laughlin Veronika Laughlin: That's right! Score numbers do not by themselves tell test users what students can do, but there are many things we can do to explain and communicate the meaning of test scores. Besides, many test users already have a basic understanding of different language proficiency levels used by international and local standards. Mapping TOEFL or TOEIC test scores to international and local standards, or frameworks such as the CEFR, the Canadian Language Benchmarks (CLB) and the CSE, makes it easier to compare test scores to proficiency levels that test users are familiar with. Take the CEFR, for example — it has six major language proficiency levels: A1 and A2 (Basic), B1 and B2 (Independent), C1 and C2 (Proficient).

To reach the B2 level, a student must be able to:

  • give clear, systematically developed descriptions and presentations, with the appropriate highlighting of significant points and relevant supporting details
  • give clear, detailed descriptions and presentations on a wide range of subjects related to his/her field of interest, expanding and supporting ideas with subsidiary points and relevant examples

If we want to map the TOEFL iBT speaking test scores to the CEFR levels, we need to know what level of proficiency a particular test score corresponds to. Does a score of 20 on the TOEFL iBT test mean that you have reached level B2 on the CEFR? Linking the test results to levels on an external standard gives students and teachers a better chance to understand what the test scores mean.
If you look at it from a broader perspective, you could say that score mapping supports ETS's mission, since it provides a more appropriate and fairer use of the test results than if we had to rely on numeric scores alone.

The illustration shows four colored arrows pointing to the right. They represent different TOEFL tests mapped on a grid that shows three proficiency levels according to the Common European Framework of Reference – CEFR. The three levels are indicated on the first vertical column on the left side: Basic, Independent and Proficient. Each level is split into two, which are displayed in a second column: A1, A2, B1, B2, C1, C2 with A1 at the bottom and C2 at the top. At the bottom of the illustration is a row indicating the age of test takers: 8+ years, 11+ years and 16+ years. The first arrow is blue and sits on top of a column marked 8+ years. Vertically the arrow covers A1, A2, and B1. The next arrow is green and sits on top of a column market 11+ years. The green arrow covers A2, B1, B2 and C1. The third arrow is yellow to its right it has a brown arrow. Both the yellow and the blue arrows sits on top a column marked 16+ years. They both cover A2, B2 and C1.

Figure 1: Chart showing how the TOEFL Family of Assessments relate to the Common European Framework of Reference (CEFR).

How do you make sure that this research is done accurately and efficiently?

Photo of Richard Tannenbaum Richard Tannenbaum: To map test scores to external frameworks, we need to agree on what score a student must reach to be associated with different levels. The process of defining the requirements for a score level, or a grade, is called standard setting. It seeks to identify the lowest acceptable test score, also called the cut score, for a level of proficiency. Let's say that we want to map test scores to two levels of an external framework, such as CEFR's levels B1 and B2 for reading. Each level must then be defined operationally, with an emphasis on the specific reading skills needed to match the requirements of first B1 and then B2. Both levels reflect a range of reading skill proficiencies, going from beginning to advanced within each level, but the standard setting we use to map different tests only needs to focus on the very beginning of each range. All we need to know is that a student made the cut for a level.

The standard-setting process involves panels of experts who help define the beginning of each level in terms of the skills required to have reached the level. A student who has made the cut is described as the "just qualified candidate" or the "borderline student." Different approaches to standard setting can be used depending on the type of test items we have — e.g., multiple choice, constructed response, or performance based — and on the amount of test data we have available. It typically takes multiple rounds of this process, where we provide the panel with feedback and help facilitate its discussions between the rounds. This interactive process is important since it helps us make sure that the panelists are making informed decisions about where it is reasonable to locate each cut score.

Are you collaborating with people outside ETS when doing research on the alignment of our international tests?

Photo of Veronika Laughlin Veronika Laughlin: Yes, we have collaborated with external test users and stakeholders to conduct score mapping studies. For example, we have done a series of studies involving experienced teachers, as well as performance data collected from actual test takers, and then mapped TOEFL Junior® scores to different levels of the CEFR.

Photo of 
Ching-Ni Hsieh Ching-Ni Hsieh: We have also conducted similar research related to local language standards and frameworks. In late 2017, we began to work with the National Education Examinations Authority (NEEA) of the Ministry of Education in China. The goal is to explore how tests in the TOEFL® Family of Assessments relate to different levels of the CSE. Our partners are also interested in aligning the CSE with international tests. The collaboration also helps Chinese test takers understand ETS's English-language tests and promotes research on foreign language ability standards and assessments in China.

Photo of Spiros Papageorgiou Spiros Papageorgiou: We are also engaging local educators in different countries for this kind of research. We should note that alignment in the United States typically means content alignment, while it can mean both content alignment and score mapping in Europe. In a recent project, we explored how the content of TOEFL Junior aligns with English as a Foreign Language (EFL) curriculum in Berlin, Germany. Such alignment work is similar to score mapping in that it focuses on how useful a test is to teachers, students and decision makers. In other words, a language test is more likely to be beneficial if its content is relevant to what is taught in the classroom. We also make our technical reports freely available so that all stakeholders are informed about the results.

What challenges have you faced in doing this research and sharing the results?

Photo of 
Ching-Ni Hsieh Ching-Ni Hsieh: One of the main challenges for us researchers is to implement a systematic and carefully facilitated standard-setting workshop to ensure the quality and meaningfulness of the outcomes. There is no objectively "correct cut score," since the standard setting behind the score mapping is based on expert judgment. The lack of a correct cut score places a significant burden on the researcher who designs and implements the workshop. Even if two researchers implement the same standard-setting method for their score-mapping study, there may still be variations in the implementation and approaches to defining the "just qualified candidate," how and what data they used and what facilitation they provided.

Photo of 
Spiros Papageorgiou Spiros Papageorgiou: When we map scores from different tests to the same external proficiency levels, there is a risk that some score users may assume that the tests are similar in terms of content or difficulty, or both, and that it doesn't matter which test you are using as long as they are mapped onto the same external proficiency levels. Most of the time, such similarities couldn't be further from the truth. Take two EFL tests as an example, one designed for young learners in middle schools and one intended for adults at the workplace. Both tests could be covering different aspects of the language proficiency expected at the same language proficiency level — for example, CEFR Level B1. However, the contents of these tests are different on purpose, since one is designed for middle school students and the other is for adults in the workplace. Therefore, the scores cannot be interpreted as if they reflect exactly the same aspects of the B1 level.

A second challenge is that score users may take the score-mapping studies as evidence of the validity of the test scores — but that is not always the case. If a language test is of questionable technical quality and it lacks enough evidence for the validity of its scores, then any attempt to conduct score mapping is a "wasted enterprise," according to the Council of Europe's score-mapping Manual. It is therefore critical that we explain to all stakeholders that although score mapping can facilitate score interpretation, it does not prove that the test scores are valid.

Find out more about the various tests offered through the TOEIC program and TOEFL Family of Assessments.

Ching-Ni Hsieh,Veronika Laughlin, Spiros Papageorgiou and Richard J. Tannenbaum work in ETS's Research & Development division. Hsieh and Laughlin are research scientists and Papageorgiou is a managing senior research scientist at the English Language Learning and Assessment (ELLA) Center. Tannenbaum is a general manager in R&D.

Learn more: