A pseudo-experimental study was conducted to examine the link between rater accuracy calibration performances and subsequent accuracy during operational scoring. The study asked 45 raters to score a 75-response calibration set and then a 100-response (operational) set of responses from a retired Graduate Record Examinations (GRE) writing prompt that had been vetted to determine correct scores prior to the study. The study found a positive relationship between calibration accuracy and scoring accuracy. Results suggest that a longer calibration test is more reliable and has a stronger correlation with operational scoring exact agreement; however, even tests as short as 10 responses had a reasonable correlation with operational accuracy, and the shorter test classified raters accurately 87% of the time when compared to the 75-response criterion. The additional requirement of no discrepant scores during calibration had better correlation with exact agreement, but much poorer classification accuracy in the shorter tests. This research suggests that calibration is useful for screening raters in its current form, but that further investigation of passing standards and of less controlled operational scoring conditions is required.