Systems and methods are provided for selecting a proposed test item for inclusion in an examination where a non-multiple choice response to the proposed test item will be automatically scored. A proposed test item is analyzed to generate a proposed test item metric, where the proposed test item is a non-multiple choice test item. The proposed test item metric is provided to a proposed test item scoring model, where the proposed test item scoring model outputs a likelihood score indicative of a likelihood that automated scoring of a response to the proposed test item would be at or above a quality level. The proposed test item is selected for inclusion in the examination based on the likelihood score.