How to Evaluate Academic English-language Tests
1. Is scoring best done by human raters or an automated process?
A combination of rating methods is best to get a complete and accurate picture of a test taker's ability. Automated scoring models can assess certain linguistic dimensions of language use, but do not measure the effectiveness of the response or the appropriateness of content.
Human raters are also needed to attend to a wider variety of features, such as response effectiveness, quality of ideas and content as well as form. Prompts designed for fully-automated scoring have been historically found to be more vulnerable to prompt-specific preparation and memorized responses.
The TOEFL® test uses automated scoring to complement human scoring for the two tasks in the Writing section. Combining human judgment and automated scoring ensures consistent, high quality measurement of a test taker's ability to communicate and succeed in academic settings.
2. Are human raters calibrated and monitored frequently for quality control?
Raters should be trained extensively, pass a certification test and be calibrated daily. Calibration should include task familiarization, guidance on scoring the task and practice on various responses. The TOEFL Speaking and Writing sections are scored using multiple, rigorously trained raters. Their work is continuously monitored for accuracy by ETS scoring leaders and checked each time they score a new test question.
3. Is rating kept separate to ensure secure, fair and objective scoring?
To ensure security and integrity, scoring should be separate from the test administration process and conducted through a centralized scoring network that implements and ensures consistent scoring standards.
The TOEFL test is scored by a network of raters, carefully controlled from a secure central location. ETS uses a highly diverse pool of raters rather than those exclusive to a test taker's country of origin. To maintain objectivity, ETS raters score responses anonymously. Multiple raters’ judgments contribute to each test taker’s speaking and writing scores to minimize rater bias.
4. Is the test based on extensive research to establish validity?
Test validity is measured by extensive research evidence to support its intended use. This evidence is collected through studies on test content, scoring processes, relationships to other measures of proficiency and the impact on teaching and learning English.
For more than 50 years, ETS has conducted ongoing research to ensure test quality. We have published more than 240 peer-reviewed research reports, books, journal articles and book chapters to support the TOEFL test design and validity.
5. Do the test tasks simulate academic settings?
It is important to confirm that the test tasks reflect true academic contexts. If not, the test scores should not be used for admissions decisions. The TOEFL test contains purely academic content and tasks created by working with experts in higher education to simulate university life and coursework and to identify the English-language demands faced by non-native English speakers.
6. Are there enough international test facilities to provide a large, diverse applicant pool?
The best way to ensure educational institutions have access to a diverse pool of applicants is to have testing facilities available all over the world. More than 30 million test takers have taken the TOEFL test at ETS-approved test centers around the world — providing a highly diverse applicant pool. The quality of the TOEFL scoring process provides a common measure for comparing the qualifications of applicants from many different backgrounds and cultures.
ETS has long been at the forefront of combatting test security concerns. Our strategy is a 3-pronged approach of prevention, detection and communication, which is designed to protect the integrity of test scores.