We derive formulas for the differential item functioning (DIF) measures that two routinely used DIF statistics are designed to estimate. The DIF measures that match on observed scores are compared to DIF measures based on an unobserved ability (theta or true score) for items that are described by either the one‐parameter logistic (1PL) or two‐parameter logistic (2PL) item response theory (IRT) model. We use two different weighting schemes (uniform weights and item discrimination weights) to construct the observed score matching variable. Our results show that (a) under the 1PL item response model, the observed score‐based DIF measures always approximate the true score‐based DIF measures very closely; (b) under the 2PL model, when the observed score is the simple sum score, the observed score‐based DIF measures underestimate or overestimate the true score‐based DIF measures under the null hypothesis of no DIF when the groups are different in ability, and this bias is related to the degree to which the average discrimination parameter underestimates or overestimates the studied discrimination parameter; and (c) under the 2PL model, when the item discrimination weights are used to define the observed score, the observed score‐based DIF measures always approximate the true score‐based DIF measures very closely. These results will hold for any sets of item responses that are described by either the 1PL or 2PL IRT model.