Systems and methods are provided for conducting a simulated conversation with a language learner include determining a first dialog state of the simulated conversation. First audio data corresponding to simulated speech based on the dialog state is transmitted. Second audio data corresponding to a variable length utterance spoken in response to the simulated speech is received. A fixed dimension vector is generated based on the variable length utterance. A semantic label is predicted for the variable-length utterance based on the fixed dimension vector. A second dialog state of the simulated conversation is determined based on the semantic label, and third audio data corresponding to simulated speech is transmitted based on the second dialog state.