Utterance classification is a critical pre-processing step for many speech understanding and dialog systems. In multi-user settings, one needs to first identify if an utterance is even directed at the system, followed by another level of classification to determine the intent of the user's input. In this work, we propose RNN and LSTM models for both these tasks. We show how both models outperform baselines based on ngram-based language models (LMs), feedforward neural network LMs, and boosting classifiers. To deal with the high rate of singleton and out-of-vocabulary words in the data, we also investigate a word input encoding based on character ngrams, and show how this representation beats the standard one-hot vector word encoding. Overall, these proposed approaches achieve over 30% relative reduction in equal error rate compared to boosting classifier baseline on an ATIS utterance intent classification task, and over 3.9% absolute reduction in equal error rate compared to a the maximum entropy LM baseline of 27.0% on an addressee detection task. We find that RNNs work best when utterances are short, while LSTMs are best when utterances are longer.
Bibliographic reference. Ravuri, Suman / Stolcke, Andreas (2015): "Recurrent neural network and LSTM models for lexical utterance classification", In INTERSPEECH-2015, 135-139.