Test-Retest Analyses of the TOEFL® Test

Author(s):
Henning, Grant
Publication Year:
1993
Report Number:
RR-93-31
TOEFL-RR-45
Source:
Document Type:
Subject/Key Words:
Item analysis repeaters subtest analysis test reliability

Abstract

The present study provides two kinds of information that have not previously been available in a single research report on the TOEFL® test (Test of English as a Foreign Language™) with regard to its total and component scores. First, the study provides comparative global and component estimates of test-retest, alternate-form, and internal-consistency reliability. This complies with the joint standards of the American Psychological Association, the American Educational Research Association, and the National Council for Measurement in Education by controlling for sources of measurement error that may be inherent both among the examinees and within the testing administration context, and not merely within the examination itself. Secondly, the study provides information about differential change in subtest difficulty on repeated application over a small interval of time (viz., eight days). This second concern is related to the phenomenon of "item bounce" and reflects the comparative stabilities of difficulty estimates within item type over repeated test administrations. This comparative stability information may provide useful insights into the functioning of particular TOEFL subtest item types and the suitability of those item types for anchoring in test equating. Although test-length-adjusted reliability estimates were found to be adequately high across reported component and total test scores, with raw score test-retest coefficients ranging from .87 to .98 (with a mean of .93 over 22 total coefficients), raw score internal-consistency coefficients ranging from .79 to .98 (with a mean of .94 over 88 total coefficients), and raw score alternate-form coefficients ranging from .78 to .97 (with a mean of .90 over 22 total coefficients), the study contained 5 several inherent limitations. Chief among these limitations was the comparatively small sample involved. Only 329 total subjects participated, and, due to attrition and design features, test-retest reliability estimates were based on separate repeating subgroups of only 101 and 91 persons. Alternate-form reliability estimates were based on separate repeating subgroups of only 52 and 25 persons. Although estimates were replicated across two TOEFL forms and at least two distinct samples, it was not possible within the existing project constraints to identify repeating samples that were perfectly representative of the current TOEFL examinee population in regard to language background and mean language proficiency.