A system for end-to-end automated scoring is disclosed. The system includes a word embedding layer for converting a plurality of ASR outputs into input tensors; a neural network lexical model encoder receiving the input tensors; a neural network acoustic model encoder implementing AM posterior probability, word duration, mean value of pitch and mean value of intensity based on a plurality of cues; and a linear regression module, for receiving concatenated encoded features from the neural network lexical model encoder and the neural network acoustic model encoder.