Describing and Categorizing DIF in Polytomous Items
- Zwick, Rebecca J.; Thayer, Dorothy T.; Mazzeo, John
- Publication Year:
- Report Number:
- Document Type:
- Subject/Key Words:
- Differential item functioning (DIF) Mantel-Haenszel technique polytomous items SIBTEST standardized mean difference statistical analysis
The purpose of this project was to evaluate statistical procedures for assessing differential item functioning (DIF) in polytomous items (items with more than two score categories). Three descriptive statistics the Standardized Mean Difference, or SMD (Dorans & Schmitt, 1991), and two procedures based on SIBTEST (Shealy & Stout, 1993) were considered, along with five inferential procedures two based on SMD, two based on SIBTEST, and the Mantel (1963) method. The DIF procedures were evaluated through applications to simulated data, as well as data from ETS tests. The simulation included conditions in which the two groups of examinees had the same ability distribution and conditions in which the group means differed by one standard deviation. When the two groups had the same distribution, the descriptive index that performed best was the SMD. When the two groups had different distributions, a modified form of the SIBTEST DIF effect size measure tended to perform best. The five inferential procedures performed almost indistinguishably when the two groups had identical distributions. When the two groups had different distributions and the studied item was highly discriminating, the SIBTEST procedures showed much better Type I error control than did the SMD and Mantel methods, particularly in short tests. The power ranking of the five procedures was inconsistent; it depended on the direction of DIF and other factors. Routine application of these polytomous DIF methods at ETS seems feasible in cases where a reliable test is available for matching examinees. For the Mantel and SMD methods, Type I error control may be a concern under certain conditions. In the case of SIBTEST, the current version cannot easily accommodate matching tests that do not use number-right scoring. Additional research in these areas is likely to be useful.