TOEIC® scores are consistent and reliable.
Evidence: The research in this section demonstrates how TOEIC Program Research helps to ensure that scores are not improperly influenced by aspects of the testing procedure that are unrelated to language ability. When examining score consistency or reliability, there are multiple aspects of the testing procedure that are considered, including:
- test items (internal consistency)
- test forms (equivalence)
- test occasions or administrations (stability)
- raters (inter- and intra-rater reliability)
Monitoring Score Change Patterns to Support TOEIC® Listening and Reading Test Quality
In large-scale, high-stakes testing programs, such as the TOEIC program, some test takers take a test more than once over time. The score change patterns of these so-called "repeaters" can be analyzed to support the overall quality of the test (e.g., its reliability, validity, intended uses). This study examined the aforementioned score change patterns, with the goal of evaluating the reliability and validity of TOEIC® Listening and Reading test scores.
Measuring English-Language Proficiency across Subgroups: Using Score Equity Assessment to Evaluate Test Fairness
English-language proficiency assessments are designed for a targeted test population and may include test takers from diverse demographic, sociocultural and educational backgrounds. The test is assumed to be fair and the scores earned by different subgroups of test takers have the same meaning. One way of evaluating the test fairness is to produce a linked test for each subgroup and compare the test score results of the linked test with the test scores of the original test they took.
How ETS Scores the TOEIC® Speaking and Writing Test Responses
Typically, human raters are used to score Speaking and Writing tests because of their ability to evaluate a broader range of language performance than automated systems. This paper describes how ETS ensures the reliability and consistency of scores by human raters for TOEIC® Speaking and Writing tests through training, certification, and systematic administrative and statistical monitoring procedures.
Linking TOEIC® Speaking Scores Using TOEIC® Listening Scores
In testing programs, multiple forms of a test are used across different administrations to prevent the overexposure of test forms and to reduce the possibility of test takers gaining advance knowledge of test content. Because slight differences may occur in the statistical difficulty of the alternate forms, a statistical procedure, known as test score linking, has been commonly used to adjust for these differences in difficulty so that test forms are comparable.
Monitoring TOEIC® Listening and Reading Test Performance across Administrations Using Examinees' Background Information
The scoring process for the TOEIC Listening and Reading test includes monitoring procedures that help ensure that scores are consistent across different test forms and test administrations, and that skill interpretations are fair. This study explores the possibility of using information about test takers' backgrounds in order to enhance several types of monitoring procedures.
Evaluating the Stability of Test Score Means for the TOEIC® Speaking and Writing Tests
For educational tests, it is critical to maintain consistency of score scales and to understand the sources of variation in score means over time. This helps ensure that interpretations about test takers' abilities are comparable from one administration (or form) to another. Using statistical procedures, this study examined the consistency of reported scores for the TOEIC® Speaking and Writing tests.
Comparison of Content, Item Statistics, and Test Taker Performance on the Redesigned and Classic TOEIC® Listening and Reading Test
This paper compares the content, reliability and difficulty of the classic and 2006 redesigned TOEIC Reading and Listening tests. Although the redesigned tests included slightly different item types to better reflect current models of English-language proficiency, the tests were judged to be similar across versions.
Statistical Analyses for the Expanded Item Formats of the TOEIC® Speaking Test
Testing programs should periodically review their assessments to ensure that their test items or tasks are well-aligned with real-world activities. For this reason, to better support communicative language learning and to discourage the use of memorization and other test-taking strategies, ETS expanded the existing format of some items of the TOEIC® Speaking test in May 2015.
Statistical Analyses for the Updated TOEIC® Listening and Reading Test
To ensure that tests continue to meet the needs of test takers and score users, it is important that testing programs periodically revisit their assessments. For this reason, in order to keep up with the continuously changing use of English and the ways in which individuals commonly communicate in the global workplace and everyday life, an updated TOEIC® Listening and Reading test was designed and first launched in May 2016.
The Consistency of TOEIC® Speaking Scores Across Ratings and Tasks
This study examines the consistency of TOEIC Speaking scores. The analysis uses a methodology based on generalizability theory, which allows researchers to examine the degree to which aspects of the testing procedure (i.e., raters, tasks) influence scores. The results contribute evidence to support claims that TOEIC Speaking scores are consistent.
Monitoring Individual Rater Performance for the TOEIC® Speaking and Writing Tests
This paper describes procedures implemented on the TOEIC Speaking and Writing tests for monitoring individual rater performance and enhancing overall scoring quality. These multifaceted, carefully developed procedures help ensure that the potential for human error is kept to a minimum, thereby contributing to the TOEIC tests' scoring consistency.
Alternate Forms Test-Retest Reliability and Test Score Changes for the TOEIC® Speaking and Writing Tests
The reliability or consistency of scores can be examined in a variety of ways, including the degree to which scores for the same test taker are consistent across different test forms (so-called "equivalent forms reliability") and different occasions of testing ("test-retest reliability"). This study examined the consistency of TOEIC Speaking and Writing scores across different test forms at different time intervals (e.g., 1–30 days, 31–60 days) and found that test scores had reasonably high equivalent form test-retest reliability.
Statistical Analyses for the TOEIC® Speaking and Writing Pilot Study
This paper reports the results of a pilot study that contributed to TOEIC Speaking and Writing test development. The analysis of the reliability of test scores found evidence of several types of score consistency, including inter-rater reliability (agreement of several raters on a score) and internal consistency (a measure based on correlation between items on the same test).
Field Study Results for the Redesigned TOEIC® Listening and Reading Test
This paper describes the results of a field study for the 2006 redesigned TOEIC Listening and Reading tests, which includes analyses of item and test difficulty, reliability and correlations between test sections with classic TOEIC Listening and Reading tests.