How to Evaluate Academic English-language Tests
1. Is scoring best done by human raters or an automated process?
A combination of rating methods is the best way to get a complete and accurate picture of a test taker's ability. Automated AI scoring can assess certain linguistic dimensions of language use, but it does not measure the effectiveness of the response or the appropriateness of the content.
Human raters are also needed to attend to a wider variety of features, such as response effectiveness, quality of ideas, and content as well as form. Tasks designed for fully AI scoring have been historically found to be more vulnerable to prompt-specific preparation and memorized responses.
The TOEFL iBT® test uses AI scoring to complement human scoring for the tasks in the Speaking and Writing sections. Combining human judgment and AI scoring ensures consistent, high-quality measurement of a test taker's ability to communicate and succeed in academic settings.
2. Are human raters calibrated and monitored frequently for quality control?
Raters should be trained extensively, pass a certification test and be calibrated daily. Calibration should include task familiarization, guidance on scoring the task and practice on various responses. The TOEFL® Speaking and Writing sections are scored using a combination of AI scoring and multiple, rigorously trained raters. Their work is continuously monitored for accuracy by ETS scoring leaders and checked each time they score a new test question.
3. Is rating kept separate to ensure secure, fair and objective scoring?
To ensure security and integrity, scoring should be separate from the test administration process and conducted through a centralized scoring network that implements and ensures consistent scoring standards.
The TOEFL test is scored by a combination of AI scoring and a network of raters, carefully controlled from a secure central location. ETS uses a highly diverse pool of raters rather than those exclusive to a test taker's country of origin. To maintain objectivity, ETS raters score responses without knowing the identity of the test taker. Multiple raters' judgments contribute to each test taker's speaking and writing scores to minimize rater bias.
4. Is the test based on extensive research to establish validity?
Test validity is measured by extensive research evidence to support its intended use. This evidence is collected through studies on test content, scoring processes, relationships to other measures of proficiency, and the impact on teaching and learning English.
For more than 50 years, ETS has conducted ongoing research to ensure the quality of the TOEFL test. We have published more than 240 peer-reviewed research reports, books, journal articles and book chapters to support test design and validity.
5. Do the test tasks simulate academic settings?
It is important to confirm that test tasks reflect true academic contexts. If not, test scores should not be used for admissions decisions. The TOEFL test contains purely academic content and uses tasks created by working with experts in higher education, to simulate university life and coursework and to identify the English-language demands faced by non-native English speakers.
6. Are there enough international test facilities to provide a large, diverse applicant pool?
The best way to ensure educational institutions have access to a diverse pool of applicants is to have testing facilities available globally. More than 35 million test takers have taken the TOEFL test at ETS-approved test centers around the world — providing a highly diverse applicant pool. The quality of the TOEFL scoring process provides a common measure for comparing the qualifications of applicants from many different backgrounds and cultures.
ETS has long been at the forefront of combatting test security concerns. Our strategy is a 3-pronged approach of prevention, detection and communication, which is designed to protect the integrity of test scores.