System and Method for Handling the Confounding Effect of Document Length on Vector-based Similarity Scores
- Author(s):
- Higgins, Derrick
- Patent Issued:
- Apr 12, 2016
- Patent Number:
- 9,311,390
- Source:
- ETS Patent
- Document Type:
- Patent
- Family ID:
- 40899302
- Subject/Key Words:
- Patent, Active Patent, Automated Essay Scoring (AES), Text Coherence, Similarity Measures, Vectors (Mathematics), Random Indexing
Abstract
A computer-implemented method, system, and computer program product for generating vector-based similarity scores in text document comparisons considering confounding effects of document length. Vector-based methods for comparing the semantic similarity between texts (such as Content Vector Analysis and Random Indexing) have a characteristic which may reduce their usefulness for some applications: the similarity estimates they produce are strongly correlated with the lengths of the texts compared. The statistical basis for this confound is described, and suggests the application of a pivoted normalization method from information retrieval to correct for the effect of document length. In two text categorization experiments, Random Indexing similarity scores using pivoted normalization are shown to perform significantly better than standard vector-based similarity estimation methods.