Minimum Semantic Error Cost Training of Deep Long Short-Term Memory Networks for Topic Spotting on Conversational Speech

Zhong Meng, Biing-Hwang Juang


The topic spotting performance on spontaneous conversational speech can be significantly improved by operating a support vector machine with a latent semantic rational kernel (LSRK) on the decoded word lattices (i.e., weighted finite-state transducers) of the speech [1]. In this work, we propose the minimum semantic error cost (MSEC) training of a deep bidirectional long short-term memory (BLSTM)-hidden Markov model acoustic model for generating lattices that are semantically accurate and are better suited for topic spotting with LSRK. With the MSEC training, the expected semantic error cost of all possible word sequences on the lattices is minimized given the reference. The word-word semantic error cost is first computed from either the latent semantic analysis or distributed vector-space word representations learned from the recurrent neural networks and is then accumulated to form the expected semantic error cost of the hypothesized word sequences. The proposed method achieves 3.5%–4.5% absolute topic classification accuracy improvement over the baseline BLSTM trained with cross-entropy on Switchboard-1 Release 2 dataset.


 DOI: 10.21437/Interspeech.2017-590

Cite as: Meng, Z., Juang, B. (2017) Minimum Semantic Error Cost Training of Deep Long Short-Term Memory Networks for Topic Spotting on Conversational Speech. Proc. Interspeech 2017, 2496-2500, DOI: 10.21437/Interspeech.2017-590.


@inproceedings{Meng2017,
  author={Zhong Meng and Biing-Hwang Juang},
  title={Minimum Semantic Error Cost Training of Deep Long Short-Term Memory Networks for Topic Spotting on Conversational Speech},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={2496--2500},
  doi={10.21437/Interspeech.2017-590},
  url={http://dx.doi.org/10.21437/Interspeech.2017-590}
}