This study evaluated expert system diagnoses of examinees' solutions to complex constructed-response algebra word problems. Problems were presented to three samples, each of which had taken the GRE General Test. One sample took the problems in paper-and-pencil form and the other two on computer. Responses were then diagnostically analyzed by an expert system, GIDE, and by four ETS mathematics test developers using a fine-grained categorization of error types. Results were highly consistent across the samples. Human judges agreed among themselves almost perfectly in describing responses as right or wrong but concurred at much lower levels (37% to 64% agreement) in categorizing the specific bugs they detected in incorrect solutions. The expert system agreed highly with the judges' right/wrong decisions (95% to 97% concurrence) and somewhat less closely (71% to 74%) with the bug categorizations that the judges, themselves, agreed on. Seven principal causes of machine-rater disagreement were detected, most of which could be remedied by making adjustments to GIDE, modifying the test presentation interface to constrain the form of examinee solutions, and working with test developers to specify rules for automatically dealing with special cases. These results suggest that highly accurate diagnostic analysis through knowledge-based understanding of complex responses may be difficult to achieve at the fine-grained level used by GIDE. The accuracy of qualitative judgments might be increased by using a smaller set of more general diagnostic categories and by integrating information from other sources, including performance on diverse item types.