TOEIC® Research

Advancing English-language assessment, teaching and learning

Score Consistency

TOEIC scores are consistent and reliable. Research from the TOEIC Research Program helps ensure that scores are influenced only by factors related to language ability. When examining score consistency or reliability, multiple aspects of the testing procedure are considered, including test items, test forms, test occasions or administrations and raters.

Field Study Statistical Analysis for the Redesigned TOEIC Bridge® Tests

This paper reports the results of a field study that contributed to the development of the redesigned TOEIC Bridge tests. The statistical analyses provide initial evidence to support claims that redesigned TOEIC Bridge test scores are consistent, and that test scores are meaningful indicators of English proficiency basic to intermediate levels.

Read Field Study Statistical Analysis for the Redesigned TOEIC Bridge Tests

Making the Case for the Quality and Use of a New Language Proficiency Assessment: Validity Argument for the Redesigned TOEIC Bridge® Tests

This paper summarizes the "validity argument" for the redesigned TOEIC Bridge tests. The validity argument consists of four major claims about score consistency, validity and fairness, appropriate test use and positive impacts; together, this provides a coherent narrative about the measurement quality and intended uses of test scores. By considering the claims and supporting evidence presented in the validity argument, readers should be able to better evaluate whether the redesigned TOEIC Bridge tests are appropriate for their situation.

Read Making the Case for the Quality and Use of a New Language Proficiency Assessment Validity Argument for the Redesigned TOEIC Bridge Tests

Monitoring Score Change Patterns to Support TOEIC® Listening and Reading Test Quality

In large-scale, high-stakes testing programs, such as the TOEIC program, some test takers take a test more than once over time. The score change patterns of these so-called "repeaters" can be analyzed to support the overall quality of the test (e.g., its reliability, validity, intended uses). This study examined the aforementioned score change patterns, with the goal of evaluating the reliability and validity of TOEIC^® Listening and Reading test scores.

read more about monitoring score change patters to support TOEIC listening and reading test quality

Measuring English-Language Proficiency across Subgroups: Using Score Equity Assessment to Evaluate Test Fairness

English-language proficiency assessments are designed for a targeted test population and may include test takers from diverse demographic, sociocultural and educational backgrounds. The test is assumed to be fair and the scores earned by different subgroups of test takers have the same meaning. One way of evaluating the test fairness is to produce a linked test for each subgroup and compare the test score results of the linked test with the test scores of the original test they took.

Measuring English-Language Proficiency across Subgroups: Using Score Equity Assessment to Evaluate Test Fairness

How ETS Scores the TOEIC® Speaking and Writing Test Responses

Typically, human raters are used to score Speaking and Writing tests because of their ability to evaluate a broader range of language performance than automated systems. This paper describes how ETS ensures the reliability and consistency of scores by human raters for TOEIC Speaking and Writing tests through training, certification, and systematic administrative and statistical monitoring procedures.

read more about how ETS scores the TOEIC speaking and writing test responses

Linking TOEIC® Speaking Scores Using TOEIC® Listening Scores

In testing programs, multiple forms of a test are used across different administrations to prevent overexposure of test forms and to reduce the possibility of test takers gaining advance knowledge of test content. Because slight differences may occur in the statistical difficulty of the alternate forms, a statistical procedure known as test score linking has been commonly used to adjust for these differences in difficulty so that test forms are comparable.

read more about linking TOEIC speaking scores using TOEIC listening scores

Monitoring TOEIC® Listening and Reading Test Performance Across Administrations Using Examinees' Background Information

The scoring process for the TOEIC Listening and Reading test includes monitoring procedures that help ensure that scores are consistent across different forms and test administrations, and that skill interpretations are fair. This study explores the possibility of using information about test takers' backgrounds in order to enhance several types of monitoring procedures. Results of the analyses suggested that some background variables may facilitate the monitoring of test performance across administrations, thereby strengthening quality control procedures for the TOEIC Listening and Reading test as well as strengthening evidence of score consistency.

read more about monitoring TOEIC listening and reading test performance across administrations using examinee’s background information

Evaluating the Stability of Test Score Means for the TOEIC® Speaking and Writing Tests

For educational tests, it is critical to maintain consistency of score scales and to understand the sources of variation in score means over time. This helps ensure that interpretations about test takers' abilities are comparable from one administration (or form) to another. Using statistical procedures, this study examined the consistency of reported scores for the TOEIC Speaking and Writing tests.

read more about evaluating the stability of test score means for the TOEIC speaking and writing tests

Comparison of Content, Item Statistics, and Test Taker Performance on the Redesigned and Classic TOEIC® Listening and Reading Test

This paper compares the content, reliability and difficulty of the classic and 2006 redesigned TOEIC Listening and Reading tests. Although the redesigned tests included slightly different item types to better reflect current models of language proficiency, the tests were judged to be similar across versions.

read more about the comparison of content, item statistics and test taker performance on the redesigned and classic TOEIC listening and reading test

Statistical Analyses for the Expanded Item Formats of the TOEIC® Speaking Test

Testing programs should periodically review their assessments to ensure that their test items or tasks are well-aligned with real-world activities. For this reason, to better support communicative language learning and to discourage the use of memorization and other test-taking strategies, ETS expanded the existing format of some items of the TOEIC^® Speaking test in May 2015.

read more about Statistical Analyses for the Expanded Item Formats of the TOEIC® Speaking Test

Statistical Analyses for the Updated TOEIC® Listening and Reading Test

To ensure that tests continue to meet the needs of test takers and score users, it is important that testing programs periodically revisit their assessments. For this reason, in order to keep up with the continuously changing use of English and the ways in which individuals commonly communicate in the global workplace and everyday life, an updated TOEIC Listening and Reading test was designed and first launched in May 2016.

read more about Statistical Analyses for the Updated TOEIC® Listening and Reading Test

The Consistency of TOEIC® Speaking Scores Across Ratings and Tasks

This study examines the consistency of TOEIC Speaking scores. The analysis uses a methodology based on generalizability theory, which allows researchers to examine the degree to which aspects of the testing procedure (i.e., raters, tasks) influence scores. The results contribute evidence to support claims that TOEIC Speaking scores are consistent.

read more about the consistency of TOEIC speaking scores across ratings and tasks

Monitoring Individual Rater Performance for the TOEIC® Speaking and Writing Tests

This paper describes procedures implemented on the TOEIC Speaking and Writing tests for monitoring individual rater performance and enhancing overall scoring quality. These multifaceted, carefully developed procedures help ensure that the potential for human error is kept to a minimum, thereby contributing to the TOEIC tests' scoring consistency and reliability.

read more about monitoring individual rater performance for the TOEIC speaking and writing tests

Alternate Forms Test-Retest Reliability and Test Score Changes for the TOEIC® Speaking and Writing Tests

The reliability or consistency of scores can be examined in a variety of ways, including the degree to which scores for the same test taker are consistent across different test forms (so-called "equivalent forms reliability") and different occasions of testing ("test-retest reliability"). This study examined the consistency of TOEIC Speaking and Writing scores across different test forms at different time intervals (e.g., 1–30 days, 31–60 days) and found that test scores had reasonably high equivalent form test-retest reliability.

read more about Alternate Forms Test-Retest Reliability and Test Score Changes for the TOEIC Speaking and Writing Tests

Statistical Analyses for the TOEIC® Speaking and Writing Pilot Study

This paper reports the results of a pilot study that contributed to TOEIC Speaking and Writing test development. The analysis of the reliability of test scores found evidence of several types of score consistency, including inter-rater reliability (agreement of several raters on a score) and internal consistency (a measure based on correlation between items on the same test).

read more about Statistical Analyses for the TOEIC Speaking and Writing Pilot Study

Field Study Results for the Redesigned TOEIC® Listening and Reading Test

This paper describes the results of a field study for the 2006 redesigned TOEIC Listening and Reading tests, which includes analyses of item and test difficulty, reliability and correlations between test sections with classic TOEIC Listening and Reading tests. Results are consistent with another comparability study (Liao, Hatrak and Yu's in 2010), which found evidence of the reliability of the redesigned tests and suggested that scores on the redesigned test could be interpreted and used in similar ways to classic TOEIC Listening and Reading test scores.

read more about field study results for the redesigned TOEIC listening and reading test