By: Hans Sandberg
Does a test score of, let's say, 70 mean that a nonnative English speaker is ready for a university where English is the primary language of instruction? How about participating in a business meeting conducted in English? Focus on ETS R&D talked to four ETS experts involved with research on how tests in the TOEFL® and TOEIC® families of English-language assessments align with curricula and educational standards, such as the Common European Framework of Reference (CEFR) and China's Standards of English (CSE).
Why align assessments, like the TOEFL and TOEIC tests, with external standards and curricula?
Richard Tannenbaum: Test scores do not always give you the kind of information you need to make an informed decision. For example, what does it mean that one learner scored 70 and another got 65? It is safe to say that the first scored higher than the second, but we would typically not think either of the results as good if the score scale for the test ranged from 0 to 200. Both would, on the other hand, have done well if the scale went from 0 to 80 and the average was 50. This shows that you need context to interpret the meaning of test results. Then we have the question of what a score means outside of the test. What does it mean if a learner scores 20 on the TOEFL iBT® speaking test, whose scores can range from 0 to 30? Does that allow us to assume that she is able to engage in a conversation with a native English speaker, or make a short presentation on an unfamiliar topic? It's hard to know simply by looking at the test score.
Veronika Laughlin: That's right! Score numbers do not by themselves tell test users what students can do, but there are many things we can do to explain and communicate the meaning of test scores. Besides, many test users already have a basic understanding of different language proficiency levels used by international and local standards. Mapping TOEFL or TOEIC test scores to international and local standards, or frameworks such as the CEFR, the Canadian Language Benchmarks (CLB) and the CSE, makes it easier to compare test scores to proficiency levels that test users are familiar with. Take the CEFR, for example — it has six major language proficiency levels: A1 and A2 (Basic), B1 and B2 (Independent), C1 and C2 (Proficient).
To reach the B2 level, a student must be able to:
- give clear, systematically developed descriptions and presentations, with the appropriate highlighting of significant points and relevant supporting details
- give clear, detailed descriptions and presentations on a wide range of subjects related to his/her field of interest, expanding and supporting ideas with subsidiary points and relevant examples
If we want to map the TOEFL iBT speaking test scores to the CEFR levels, we need to know what level of proficiency a particular test score corresponds to. Does a score of 20 on the TOEFL iBT test mean that you have reached level B2 on the CEFR? Linking the test results to levels on an external standard gives students and teachers a better chance to understand what the test scores mean.
If you look at it from a broader perspective, you could say that score mapping supports ETS's mission, since it provides a more appropriate and fairer use of the test results than if we had to rely on numeric scores alone.
Figure 1: Chart showing how the TOEFL Family of Assessments relate to the Common European Framework of Reference (CEFR).
How do you make sure that this research is done accurately and efficiently?
Richard Tannenbaum: To map test scores to external frameworks, we need to agree on what score a student must reach to be associated with different levels. The process of defining the requirements for a score level, or a grade, is called standard setting. It seeks to identify the lowest acceptable test score, also called the cut score, for a level of proficiency. Let's say that we want to map test scores to two levels of an external framework, such as CEFR's levels B1 and B2 for reading. Each level must then be defined operationally, with an emphasis on the specific reading skills needed to match the requirements of first B1 and then B2. Both levels reflect a range of reading skill proficiencies, going from beginning to advanced within each level, but the standard setting we use to map different tests only needs to focus on the very beginning of each range. All we need to know is that a student made the cut for a level.
The standard-setting process involves panels of experts who help define the beginning of each level in terms of the skills required to have reached the level. A student who has made the cut is described as the "just qualified candidate" or the "borderline student." Different approaches to standard setting can be used depending on the type of test items we have — e.g., multiple choice, constructed response, or performance based — and on the amount of test data we have available. It typically takes multiple rounds of this process, where we provide the panel with feedback and help facilitate its discussions between the rounds. This interactive process is important since it helps us make sure that the panelists are making informed decisions about where it is reasonable to locate each cut score.
Are you collaborating with people outside ETS when doing research on the alignment of our international tests?
Veronika Laughlin: Yes, we have collaborated with external test users and stakeholders to conduct score mapping studies. For example, we have done a series of studies involving experienced teachers, as well as performance data collected from actual test takers, and then mapped TOEFL Junior® scores to different levels of the CEFR.
Ching-Ni Hsieh: We have also conducted similar research related to local language standards and frameworks. In late 2017, we began to work with the National Education Examinations Authority (NEEA) of the Ministry of Education in China. The goal is to explore how tests in the TOEFL® Family of Assessments relate to different levels of the CSE. Our partners are also interested in aligning the CSE with international tests. The collaboration also helps Chinese test takers understand ETS's English-language tests and promotes research on foreign language ability standards and assessments in China.
Spiros Papageorgiou: We are also engaging local educators in different countries for this kind of research. We should note that alignment in the United States typically means content alignment, while it can mean both content alignment and score mapping in Europe. In a recent project, we explored how the content of TOEFL Junior aligns with English as a Foreign Language (EFL) curriculum in Berlin, Germany. Such alignment work is similar to score mapping in that it focuses on how useful a test is to teachers, students and decision makers. In other words, a language test is more likely to be beneficial if its content is relevant to what is taught in the classroom. We also make our technical reports freely available so that all stakeholders are informed about the results.
What challenges have you faced in doing this research and sharing the results?
Ching-Ni Hsieh: One of the main challenges for us researchers is to implement a systematic and carefully facilitated standard-setting workshop to ensure the quality and meaningfulness of the outcomes. There is no objectively "correct cut score," since the standard setting behind the score mapping is based on expert judgment. The lack of a correct cut score places a significant burden on the researcher who designs and implements the workshop. Even if two researchers implement the same standard-setting method for their score-mapping study, there may still be variations in the implementation and approaches to defining the "just qualified candidate," how and what data they used and what facilitation they provided.
Spiros Papageorgiou: When we map scores from different tests to the same external proficiency levels, there is a risk that some score users may assume that the tests are similar in terms of content or difficulty, or both, and that it doesn't matter which test you are using as long as they are mapped onto the same external proficiency levels. Most of the time, such similarities couldn't be further from the truth. Take two EFL tests as an example, one designed for young learners in middle schools and one intended for adults at the workplace. Both tests could be covering different aspects of the language proficiency expected at the same language proficiency level — for example, CEFR Level B1. However, the contents of these tests are different on purpose, since one is designed for middle school students and the other is for adults in the workplace. Therefore, the scores cannot be interpreted as if they reflect exactly the same aspects of the B1 level.
A second challenge is that score users may take the score-mapping studies as evidence of the validity of the test scores — but that is not always the case. If a language test is of questionable technical quality and it lacks enough evidence for the validity of its scores, then any attempt to conduct score mapping is a "wasted enterprise," according to the Council of Europe's score-mapping Manual. It is therefore critical that we explain to all stakeholders that although score mapping can facilitate score interpretation, it does not prove that the test scores are valid.
Ching-Ni Hsieh,Veronika Laughlin, Spiros Papageorgiou and Richard J. Tannenbaum work in ETS's Research & Development division. Hsieh and Laughlin are research scientists and Papageorgiou is a managing senior research scientist at the English Language Learning and Assessment (ELLA) Center. Tannenbaum is a general manager in R&D.