In spoken language understanding, getting manually labeled data such as domain, intent and slot labels is usually required for training classifiers. Starting with some manually labeled data, we propose a data generation approach to augment the training set with synthetic data sampled from a joint distribution between an input query and an output label. We propose using a recurrent neural network to model the joint distribution and sample synthetic data for classifier training. Evaluated on ATIS and live logs of Cortana, a Microsoft voice personal assistant, we showed consistent performance improvement on domain classification, intent classification, and slot tagging on multiple languages.
Bibliographic reference. Tam, Yik-Cheung / Shi, Yangyang / Chen, Hunk / Hwang, Mei-Yuh (2015): "RNN-based labeled data generation for spoken language understanding", In INTERSPEECH-2015, 125-129.