IRT equating methods have been used successfully with the TOEFL® test (Test of English as a Foreign Language™) for many years, and for the most part the observed properties of items have been consistent with model predictions. However, items that do not appear to hold their IRT pretest estimates do exist. If relationships can be found between features of TOEFL items in pretest calibrations and subsequent lack of model-data fit when these items are used in final forms, steps to eliminate the use of such items in TOEFL final forms can be taken. The purpose of this study was to provide an exploratory investigation of item features that may contribute to a lack of invariance of TOEFL item parameters. The results of the study indicated the following: (a) subjective and quantitative measures developed for the study provided consistent information related to the model-data fit of TOEFL test items, (b) for Sections 1 and 2, items that were pretested before 1986 exhibited poorer model-data fit than items that were pretested after 1986, and (c) for Section 3 reading comprehension, model-data fit appeared to be related to changes in the relative position of items within the sections from the pretest to the final form administrations. Based on the results of the study, it was recommended that (a) the TOEFL program investigate the feasibility of not using pretest IRT statistics for items pretested before 1986 for Sections 1 and 2 and (b) that guidelines be developed for test developers to use with reading comprehension items to limit the change in relative positions of items in the test from pretest to final form administrations.