skip to main content skip to footer

Scalar Analysis of the Test of Written English TWE

Henning, Grant
Publication Year:
Report Number:
ETS Research Report
Document Type:
Page Count:
Subject/Key Words:
English (Second Language), English Tests, Essay Tests, Rasch Model, Rating Scales, Scaling, Test of Written English (TWE)


The present research was conducted to explore the psychometric characteristics of the TWE® (Test of Written English™) rating scale. Rasch model scalar analysis methodology was employed with more than 4,000 scored essays across two elicitation prompts to gather the following information about the TWE rating scale and rating process: 1) the position and size of the interval on the overall latent trait that could be attributed to behavioral descriptors accompanying each possible integer scoring step on the TWE scale; 2) the standard error of estimate associated with each possible transformed integer rating; 3) the fit of rating scale steps and individual rated essays to a unidimensional model of writing ability and, concurrently, the adequacy of such a model, including the proportion of misfitting essays as a portion of all essays analyzed; 4) the fit of individual readers to a unidimensional model of writing ability and to the expectations of a chi-square contingency test of independence of readers and ratings assigned, along with information on some characteristics of misfitting readers; and 5) comparative scalar information for two distinct TWE elicitation prompts, including nonparametric tests of the independence of readers and scale steps assigned and the feasibility of equating of scales. Results suggested that the intervals between TWE scale steps were surprisingly uniform and that the size of the intervals was appropriately larger than the error associated with assignment of individual ratings. The proportion of positively misfitting essays was small (approximately 1% of all essays analyzed) and was approximately equal to the proportion of essays that required adjudication by a third reader. This latter finding, along with the low proportion of misfitting readers detected, provided preliminary evidence of the feasibility of employing Rasch rating scale analysis methodology for the equating of TWE essays prepared across prompts. Some information on characteristics of misfitting readers was presented that could be useful in the reader training process.

Read More