Data is received that encapsulates a spoken response to a prompt text comprising a string of words. Thereafter, the received data is transcribed into a string of words. The string of words is then compared with a prompt so that a similarity grid representation of the comparison can be generated that characterizes a level of similarity between the string of words in the spoken response and the string of words in the prompt text. The grid representation is then scored using at least one machine learning model. The score indicates a likelihood of the spoken response having been off-topic. Data providing the encapsulated score can then be provided. Related apparatus, systems, techniques and articles are also described.