Systems and methods are provided for scoring video clips using visual feature extraction. A signal including a video clip of a subject is received. For each frame of the video clip, physiological features of the subject visually rendered in the video clip are extracted. A plurality of visual words associated with the extracted physiological features are determined. A document including the plurality of visual words is generated. A plurality of feature vectors associated with the document are determined. The plurality of feature vectors to a regression model for scoring are provided.