Monitoring Sources of Variability Within the Test of Spoken English™ Assessment System

Author(s):
Myford, Carol M.; Wolfe, Edward W.
Publication Year:
2000
Report Number:
RR-00-06
TOEFL-RR-65
Source:
Document Type:
Subject/Key Words:
Oral Assessment Second Language Performance Assessment Item Response Theory (IRT) Rater Performance Rasch Measurement Facets

Abstract

The purposes of this study were to examine four sources of variability within the Test of Spoken English™ (TSEĀ®) assessment system, to quantify ranges of variability for each source, to determine the extent to which these sources affect examinee performance, and to highlight aspects of the assessment system that might suggest a need for change. Data obtained from the February and April 1997 TSE scoring sessions were analyzed using Facets. The analysis showed that, for each of the two TSE administrations, the test usefully separated examinees into eight statistically distinct proficiency levels. The examinee proficiency measures were found to be trustworthy in terms of their precision and stability. It is important to note, though, that the standard error of measurement varies across the score distribution, particularly in the tails of the distribution. The items on the TSE appear to work together; ratings on one item correspond well to ratings on the other items. Yet, none of the items seem to function in a redundant fashion. Ratings on individual items within the test can be meaningfully combined; there is little evidence of psychometric multidimensionality in the two data sets. Consequently, it is appropriate to generate a single summary measure to capture the essence of examinee performance across the 12 items. However, the items differ little in terms of difficulty, thus limiting the instrument's ability to discriminate among levels of proficiency. The TSE rating scale functions as a five-point scale, and the scale categories are clearly distinguishable. The scale maintains a similar though not identical category structure across all 12 items. Raters differ somewhat in the levels of severity they exercise when they rate examinee performances. The vast majority used the scale in a consistent fashion, though. If examinees' scores were adjusted for differences in rater severity, the scores of two-thirds of the examinees in these administrations would have differed from their raw score averages by 0.5 to 3.6 raw score points. Such differences can have important consequences for examinees whose scores lie in critical decision-making regions of the score distribution.

Read More