The Theory Behind the TOEIC Program

How can you determine whether a test is suitable for the purpose for which it was designed? This fundamental question of validity is a concern for test developers, researchers and score users. Professional standards have come to embrace the view that test developers must convince stakeholders (i.e., anyone affected by the test) that the intended use of a test is appropriately supported or justified. This view is formalized in the argument-based approach to justifying test use.

The paper Articulating and Evaluating Validity Arguments for the TOEIC^® Tests provides an accessible introduction to the argument-based approach, its implementation for TOEIC tests and its perceived benefits for stakeholders.

The paper begins with a brief overview of the assessment use argument, a prominent argument-based approach to validation. Next, it describes the process used to build validation arguments for TOEIC tests.

This process incorporated evidence from a variety of sources, including test documentation, monitoring activities and research. Finally, the paper provides an overview of the two primary ways in which the TOEIC validation arguments are used: to prioritize research, and to communicate with stakeholders.

Overall, this process demonstrates how TOEIC research takes a broad, critical and rigorous approach to support appropriate uses of the TOEIC tests. This work also intends to improve the assessment literacy of stakeholders by focusing on the critical claims that all test developers should support.

Purpose

The argument-based approach to justifying test use presumes that test developers must convince stakeholders (i.e., anyone affected by the test) that the intended use of the test is justified. To this end, the test developer makes explicit claims regarding how test scores should be interpreted and used to make decisions. These claims are supported or undermined by evidence which may include documentation from the test development process and/or ongoing research. Through an examination of the test developer's claims and the evidence to support them, stakeholders may arrive at a global evaluation of whether the intended use of the test is justified. This approach is used to:

guide test development
provide direction for ongoing research
serve as an accountability tool for different stakeholder groups

Structure

An Assessment Use Argument is "a conceptual framework for guiding the development and use of a particular language assessment, including the interpretations and uses we make on the basis of the assessment" (Bachman and Palmer, 2010, 99). The framework is structured as a hierarchical set of claims made by the test developer regarding how test scores should be interpreted and used to make decisions. It takes the following general form:

Graphic showing test performance leading to score, leading to score interpretation, leading to decision, leading to consequences

Each component in the figure above represents a claim. At the highest level, the test developer may claim that the consequences that are an outcome of the decisions made based on the test are beneficial for all stakeholder groups (e.g., decision errors have been minimized). This presumes a claim regarding the decisions that follow from score interpretations — specifically, that decisions are equitable and sensitive to the values of relevant institutions (educational, societal, organizational, legal). In order to justify interpretations about test-taker abilities based on scores, the test developer makes claims about the meaningfulness, impartiality, generalizability, relevance and sufficiency of interpretations. Finally, all of these claims rest upon the foundational claim that scores based on test-taker performances are consistent across test forms, administrations and raters. Thus, each claim in an AUA consists of:

an outcome of test use (e.g., the decisions that follow from interpretations about test-taker abilities)
qualities of that outcome (e.g., decisions are values-sensitive and equitable)

Both decision makers and test developers share responsibility for justifying assessment use. Test developers are expected to provide evidence to support the claim that test scores are consistent, and that scores may be used to make interpretations about test-taker abilities. Decision makers need to demonstrate that decisions are values-sensitive and equitable, and that consequences of decisions are beneficial. Unfortunately, decision makers may lack the expertise needed to provide adequate backing for these claims (e.g., documentation from standard setting, estimates of decision errors). Consequently, an AUA may be enhanced through collaboration between decision makers and test developers. At the very least, feedback from decision makers should be sought by test developers in order to determine whether claims about the decisions and consequences based on test use may be justified.

Utility

As a whole, the structure of an AUA provides a basis for a comprehensive justification of test use that links real-world concerns about decisions and their consequences with the traditional concerns of test developers — reliability and validity. As a comprehensive list of claims, warrants, backing and rebuttals, it can be used to identify weaknesses in the overall argument for test use and prioritize research or test development projects.

Finally, as a simple hierarchical set of claims (as shown in the figure above), an AUA can be used as a communication tool that illustrates the key issues that determine important qualities of the usefulness of a test, including fairness, impact, reliability and validity. The concerns of individuals and stakeholder groups vary, and one of the challenges for research is addressing these concerns in a coherent manner, while enhancing the assessment literacy of stakeholders. Concerns can include:

Score consistency
"How can you make sure that all raters follow the scoring guides?"
The interpretation of scores
"When we calculate criterion validity, who or what is the criterion?"
The decisions based on these interpretations
"What are the cutscores in other institutions?"
Consequences of test use
"How have the TOEIC tests been helpful for job seekers?"
Test use that relates to a number of these issues
"How can recruiters know that TOEIC scores meet the needs of the market?"

By delivering versions of an AUA oriented toward specific stakeholder groups, a test developer with a strong research program may be able to help stakeholders find answers to their questions and become more sophisticated consumers of assessment products.

We provide a description of how this approach has been implemented for the redesigned TOEIC Bridge^® tests in the paper, "Making the case for the quality and use of a new language proficiency assessment: Validity argument for the redesigned TOEIC Bridge tests." In this paper, researchers describe the evidence supporting specific claims about score consistency, the interpretation of test scores, decisions based on test scores and the consequences of test use. This synthesis encourages stakeholders to critically engage with the actual claims (and evidence) about what a test measures and how it is intended to be used. This level of engagement can help stakeholders better understand whether the tests are well-suited to meet their needs, as well as their role in facilitating effective use of the tests.

Bachman, L. F., & Palmer, A. (2010). Language assessment in practice. Oxford: Oxford University Press.

Schmidgall, J. (2017). Articulating and evaluating validity arguments for the TOEIC^® tests (Research Memorandum No. RM-13-09). ETS.

Schmidgall, J., Cid, J., Carter Grissom, E., & Li, L. (2021). Making the case for the quality and use of a new language proficiency assessment: Validity argument for the redesigned TOEIC Bridge^® Tests (Research Report No. RR-21-20). ETS.

The Theory Behind the TOEIC Program

The Argument-based Approach

The Purpose, Structure and Utility of an Assessment Use Argument (AUA)

Purpose

Structure

Utility

Implementations of this Approach for TOEIC Tests

Reference