The TOEIC® program is used by organizations around the world to prepare graduates for the global work environment and to hire, place, train and promote employees.
In order to provide valid, fair and reliable test scores to these organizations, the TOEIC® Research Program aims to support the quality of the TOEIC® tests through ongoing research.
In addition, ETS and the TOEIC Program are committed to advancing English-language teaching and learning by providing:
- consistent and reliable test scores
- score interpretations that are meaningful, fair and relevant to the real world
- scores that can be used to make fair and equitable decisions reflecting the needs and priorities of score users
- beneficial outcomes to test takers and score users with a positive impact on the teaching and learning of English worldwide
For an inclusive overview, see a list of all TOEIC research studies and reports in one place and understand why organizations trust TOEIC scores for their relevant decision making.
Monitoring Score Change Patterns to Support TOEIC® Listening and Reading Test Quality
In large-scale, high-stakes testing programs, such as the TOEIC program, some test takers take a test more than once over time. The score change patterns of these so-called "repeaters" can be analyzed to support the overall quality of the test (e.g., its reliability, validity, intended uses). This study examined the aforementioned score change patterns, with the goal of evaluating the reliability and validity of TOEIC® Listening and Reading test scores.
Validity: What Does It Mean for the TOEIC® Tests?
This paper provides a nontechnical overview of test development and research projects undertaken to ensure that TOEIC test scores serve as valid indicators of test takers' skills to communicate in English in global workplace environments.
Setting Standards on the TOEIC® Listening and Reading Test and the TOEIC® Speaking and Writing Tests: A Recommended Procedure
Employers often use TOEIC test scores as one source of information to make a number of decisions. These include:
- recruitment of new employees
- movement of current employees into jobs that require English-language skills
- placement of employees into English-language training programs
Assessing English-Language Proficiency In All Four Language Domains: Is It Really Necessary?
This article examines and argues in favor of assessing English-language proficiency using a comprehensive four-skill assessment (i.e., listening, speaking, reading and writing) rather than just a select subset of those skills. Different use cases for TOEIC tests lead to the conclusion that in most cases, English-language proficiency is best evaluated using a comprehensive four-skill assessment.
How ETS Scores the TOEIC® Speaking and Writing Test Responses
Typically, human raters are used to score Speaking and Writing tests because of their ability to evaluate a broader range of language performance than automated systems. This paper describes how ETS ensures the reliability and consistency of scores by human raters for TOEIC® Speaking and Writing tests through training, certification, and systematic administrative and statistical monitoring procedures.
The Relationship Among TOEIC® Listening, Reading, Speaking and Writing Skills
Through examination of test scores, this research found that the TOEIC tests measure distinct but related skills, and that, taken together, they provide a reasonably complete picture of English-language proficiency. This finding provides additional evidence that four-skill approach to language proficiency assessment is crucial.
Mapping TOEIC® Test Scores to the STANAG 6001 Language Proficiency Levels
STANAG 6001 is a NATO Standardization Agreement which describes explicit listening, speaking, reading and writing proficiency levels necessary for military personnel. This study aimed to identify which minimum scores for each of the TOEIC tests' four skill areas correspond to the different STANAG proficiency levels. Thus, this study provides guidance to score users who need to make decisions about language proficiency based on achievement of STANAG proficiency levels.
The Case for a Comprehensive, Four-Skill Assessment of English-Language Proficiency
This paper explains how four-skill language testing is the best way to evaluate whether someone can communicate in English, and explains how this approach can:
- result in a fairer way of assessment for test takers
- improve the quality of test users' decisions
- create more positive impact for decision makers, teachers and learners
Monitoring TOEIC® Listening and Reading Test Performance Across Administrations Using Examinees' Background Information
The scoring process for the TOEIC Listening and Reading test includes monitoring procedures that help ensure that scores are consistent across different forms and test administrations, and that skill interpretations are fair. This study explores the possibility of using information about test takers' backgrounds in order to enhance several types of monitoring procedures. Results of the analyses suggested that some background variables may facilitate the monitoring of test performance across administrations, thereby strengthening quality control procedures for the TOEIC Listening and Reading test as well as strengthening evidence of score consistency.
Statistical Analyses for the Updated TOEIC® Listening and Reading Test
To ensure that tests continue to meet the needs of test takers and score users, it is important that testing programs periodically revisit their assessments. For this reason, in order to keep up with the continuously changing use of English and the ways in which individuals commonly communicate in the global workplace and everyday life, an updated TOEIC Listening and Reading test was designed and first launched in May 2016.
The Case of Taiwan: Perceptions of College Students about the Use of TOEIC® Tests to Graduate
This study examines test taker perceptions about the use of the TOEIC test as one of the college English-language exit tests of Taiwan's higher education institutions. The results suggest that the use of TOEIC test scores as a requirement for graduation has a positive impact on language learning. Such test use has also proven to be in line with the intended use of the TOEIC test: To prepare test takers to gain a competitive edge in the job market.
Evaluating the Stability of Test Score Means for the TOEIC® Speaking and Writing Tests
For educational tests, it is critical to maintain consistency of score scales and to understand the sources of variation in score means over time. This helps ensure that interpretations about test takers' abilities are comparable from one administration (or form) to another. Using statistical procedures, this study examined the consistency of reported scores for the TOEIC Speaking and Writing tests.
Measuring English-language Workplace Proficiency across Subgroups: Using CFA Models to Validate Test Score Interpretation
This study used a statistical technique called "factor analysis" to determine which statistical model best explained performance on the TOEIC Listening and Reading test. Researchers found that a model (two-factor model) in which reading and listening skills were represented as distinct abilities best accounted for performance, consistent with how scores are supposed to be interpreted.
Insights into Using TOEIC® Scores to Inform Human Resource Management Decisions
This study provided preliminary insights into how TOEIC scores are used to inform personnel decisions related to the hiring, promotion and training of employees. The ultimate objective was to support appropriate test score use and meaningful score-based interpretations in order to facilitate human resource management decisions.
Comparison of Content, Item Statistics, and Test Taker Performance on the Redesigned and Classic TOEIC® Listening and Reading Test
This paper compares the content, reliability and difficulty of the classic and 2006 redesigned TOEIC Listening and Reading tests. Although the redesigned tests included slightly different item types to better reflect current models of language proficiency, the tests were judged to be similar across versions.
Background and Goals of the TOEIC® Listening and Reading Test Update Project
This report describes the goals and outcomes of a project to update the TOEIC Listening and Reading test in 2016. The use of English for communication, particularly in international workplace contexts, is continually evolving. Therefore, the TOEIC Listening and Reading test is reexamined periodically to ensure that the test content reflects current communication in the workplace and in daily life, thereby supporting meaningful interpretations about English-language skills and promoting a positive impact on English teaching and learning.
Linking TOEIC® Speaking Scores Using TOEIC® Listening Scores
In testing programs, multiple forms of a test are used across different administrations to prevent overexposure of test forms and to reduce the possibility of test takers gaining advance knowledge of test content. Because slight differences may occur in the statistical difficulty of the alternate forms, a statistical procedure known as test score linking has been commonly used to adjust for these differences in difficulty so that test forms are comparable.
The Consistency of TOEIC® Speaking Scores Across Ratings and Tasks
This study examines the consistency of TOEIC Speaking scores. The analysis uses a methodology based on generalizability theory, which allows researchers to examine the degree to which aspects of the testing procedure (i.e., raters, tasks) influence scores. The results contribute evidence to support claims that TOEIC Speaking scores are consistent.
Expanding the Question Formats of the TOEIC® Speaking Test
Traditionally, researchers have used the term "authenticity" to refer to the degree to which tasks on a language test correspond to those used in the real world, with authenticity being a desired characteristic of tasks and tests. This white paper explains how the format of several questions in the TOEIC Speaking test was expanded to include a greater variety of real-world situations.
Monitoring Individual Rater Performance for the TOEIC® Speaking and Writing Tests
This paper describes procedures implemented on the TOEIC Speaking and Writing tests for monitoring individual rater performance and enhancing overall scoring quality. These multifaceted, carefully developed procedures help ensure that the potential for human error is kept to a minimum, thereby contributing to the TOEIC tests' scoring consistency and reliability.
Analyzing Item Generation with Natural Language Processing Tools for the TOEIC® Listening Test
The TOEIC Listening test includes items or tasks related to the global workplace and with a variety of authentic contexts. As the need for an ever larger number of test forms has increased, an important goal for the TOEIC Listening test has been to increase the efficiency of item generations by maintaining a large pool of items across a wide range of contexts has been an important goal for the TOEIC Listening test.
Statistical Analyses for the Expanded Item Formats of the TOEIC® Speaking Test
Testing programs should periodically review their assessments to ensure that their test items or tasks are well-aligned with real-world activities. For this reason, to better support communicative language learning and to discourage the use of memorization and other test-taking strategies, ETS expanded the existing format of some items of the TOEIC® Speaking test in May 2015.
The Incremental Contribution of TOEIC® Listening, Reading, Speaking and Writing Tests to Predicting Performance on Real-Life English-Language Tasks
This study investigated whether proficiency in a particular language skill (e.g., speaking) could be better estimated by considering not only the TOEIC test scores corresponding to that skill, but also TOEIC tests scores for other skills. The results supported this assertion, suggesting that scores on the four-skill TOEIC tests together provide a more valid measurement of English-language proficiency than any skill in isolation.
Alternate Forms Test-Retest Reliability and Test Score Changes for the TOEIC® Speaking and Writing Tests
The reliability or consistency of scores can be examined in a variety of ways, including the degree to which scores for the same test taker are consistent across different test forms (so-called "equivalent forms reliability") and different occasions of testing ("test-retest reliability"). This study examined the consistency of TOEIC Speaking and Writing scores across different test forms at different time intervals (e.g., 1–30 days, 31–60 days) and found that test scores had reasonably high equivalent form test-retest reliability.
The TOEIC® Listening, Reading, Speaking and Writing Tests: Evaluating Their Unique Contribution to Assessing English-Language Proficiency
This study investigates:
- The extent to which TOEIC test scores of one ability correlate with test takers' self-assessments of their English abilities across all four skills
- Whether one English skill (e.g., reading) can be more accurately estimated or predicted using multiple other TOEIC test scores, i.e., listening, speaking and writing
Statistical Analyses for the TOEIC® Speaking and Writing Pilot Study
This paper reports the results of a pilot study that contributed to TOEIC Speaking and Writing test development. The analysis of the reliability of test scores found evidence of several types of score consistency, including inter-rater reliability (agreement of several raters on a score) and internal consistency (a measure based on correlation between items on the same test).
Evidence-Centered Design: The TOEIC® Speaking and Writing Tests
Evidence-Centered Design (ECD) is an assessment development methodology which explicitly clarifies what an assessment measures and supports skills interpretations based on test scores. This paper describes the ECD processes used to develop the TOEIC Speaking and Writing tests. Evidence collected through the test design process produced foundational support for the validity of TOEIC Speaking and Writing test score interpretations.
Constructed-Response (CR) Differential Item Functioning (DIF) Evaluations for TOEIC® Speaking and Writing Tests
Differential item functioning (DIF) is a statistical procedure used to identify items or tasks that are unexpectedly biased in some way, inappropriately favoring one group of test takers over another. One of the challenges for speaking and writing tests is the lack of proven, practical DIF techniques that can be used to analyze performance-based or "constructed-response" tests. This paper investigates several such techniques and illustrates how research is being conducted to ensure the fairness of score interpretations.
Field Study Results for the Redesigned TOEIC® Listening and Reading Test
This paper describes the results of a field study for the 2006 redesigned TOEIC Listening and Reading tests, which includes analyses of item and test difficulty, reliability and correlations between test sections with classic TOEIC Listening and Reading tests. Results are consistent with another comparability study (Liao, Hatrak and Yu's in 2010), which found evidence of the reliability of the redesigned tests, and suggested that scores on the redesigned test could be interpreted and used in similar ways to classic TOEIC Listening and Reading test scores.
Validating TOEIC Bridge™ Scores Against Teacher and Student Ratings: A Small-Scale Study
This study sought to assess the degree to which TOEIC Bridge™ scores correspond to student self-assessments and teacher assessments of students, two measurements of English-language proficiency. TOEIC Bridge scores were found to be moderately correlated with these measurements, a finding which provides validity evidence that TOEIC Bridge scores can be meaningfully interpreted as indicators of English-language proficiency.
TOEIC Bridge™ Scores: Validity Evidence from Korea and Japan
This study sought to compare TOEIC Bridge scores to test takers' self-evaluations of their own abilities to perform everyday language tasks in English. The results suggest that the test scores correlated well with test takers' self-evaluations, providing further evidence in support of the of TOEIC Bridge scores as valid and fair indicators of English-language proficiency.
Background and Goals of the TOEIC® Listening and Reading Test Redesign Project
As time progresses, it becomes important to revisit the design of a test to ensure that its conceptualization of language proficiency aligns with current theory and test tasks continue to be indicative of real-world tasks. This report outlines the goals, theoretical alignment, procedures and outcomes of a redesign effort for the TOEIC Listening and Reading test.
The Redesigned TOEIC® Listening and Reading Test: Relations to Test Taker Perceptions of Proficiency in English
After any test redesign project — such as the redesign of the TOEIC Listening and Reading test in 2006 — it is important to provide evidence that test scores can still be meaningfully interpreted. This study examined the relationship between scores on the redesign of the TOEIC Listening and Reading test and test takers' perceptions of their own English proficiency. Researchers found moderate correlations between the test scores and test takers' perceptions, providing evidence that scores on the redesigned TOEIC Listening and Reading tests are meaningful indicators of English ability.
The Relationships of Test Scores Measured by the TOEIC® Listening and Reading Test and TOEIC® Speaking and Writing Tests
This study examines the relationship between TOEIC Listening and Reading scores and TOEIC Speaking and Writing scores in order to determine whether or not Listening and Reading scores should be used as predictors of Speaking and Writing scores and vice versa. Findings support the validity of test scores for the measured skills (e.g., Listening and Reading test scores provide meaningful interpretations of Listening and Reading skills).
TOEIC® Speaking and Writing Tests: Relations to Test Taker Perceptions of Proficiency in English
This study sought to compare scores on the TOEIC Speaking and Writing tests to students' self-evaluations of their abilities to perform everyday English-language tasks. The researchers reported relatively strong correlations between test scores and the self-evaluations. This finding contributes further evidence in support of TOEIC Speaking and Writing test scores as indicators of English-language proficiency. This study was also published as Powers, Kim, Weng, and Van Winkle (2009).
TOEIC® Listening and Reading Test Scale Anchoring Study
Scale anchoring is a process that groups test scores into score ranges or proficiency levels. It uses a combination of statistical methods and expert judgment to produce descriptions of the skills and knowledge typically exhibited by test takers at each proficiency level. This research report describes the scale anchoring process for TOEIC Listening and Reading tests, which facilitates meaningful score interpretations.
Relating Scores on the TOEIC Bridge™ Test to Student Perceptions of Proficiency in English
This study investigated the relationship between TOEIC Bridge scores and students' evaluations of their own English-language proficiency. The TOEIC Bridge test scores were found to be correlated with self-reported reading and listening skills, providing evidence that TOEIC Bridge test scores are valid or meaningful indicators of English-language reading and listening proficiency.
Validating TOEIC Bridge™ Scores Against Teacher Ratings for Vocational Students in China
This study compared TOEIC Bridge scores with teachers' assessments of test takers' abilities to perform everyday language tasks in English. The authors reported moderate correlations between these assessments and test scores, which provide supporting evidence of the validity of TOEIC Bridge test scores as indicators of English-language proficiency.