Using a Trainable Pattern-Directed Computer Program to Score Natural Language Item Responses
Kaplan, Randy M.
- Publication Year:
- Report Number:
RR-91-31, GREB-89-19P (1991)
ETS Research Report
- Document Type:
- Page Count:
- Subject/Key Words:
Graduate Record Examinations Board,
Automated Scoring and Natural Language Processing
This study investigated a computer scoring procedure for constructed-response items whose responses consist of natural language phrases or sentences. The costs for scoring such items either manually or by a traditional natural language computer-processing approach are typically expensive. The purpose of this study was to investigate the feasibility of an alternate method to the traditional natural language processing approach. The two main goals of the study were to create a prototype scoring program based on an appropriate methodology and to carry out a preliminary comparison of the results of the scoring methodology to the scores of human raters. The approach to scoring is based on the assumption that a set of responses can be described as a small language. From this language a grammar can be constructed that describes the language. The grammar can then be used as a tool for recognizing sentences in the language. With respect to the task of scoring, this means creating a grammar from a sample of responses that have been scored correct or incorrect. This grammar is then used by a program to classify responses that were not used to create the grammar. The program automatically classifies a response as correct (or incorrect). The computer program was used to score sets of responses from three items. Two sets of data consisted of short responses (3 to 5 words). The third set of data consisted of longer responses (12 to 15 words). The scoring program rating agreed perfectly with those of human raters for the first two data sets and achieved a maximum agreement of 90% for the third data set. It is concluded from these preliminary results that a trainable pattern- directed computer-scoring program may be a viable, cost-effective approach to scoring short natural language responses. Longer responses present a number of problems that require further investigation. The authors suggest a number of ways that the prototype program can be extended to compensate for these more complex responses.