Maximum a posteriori Based Decoding for CTC Acoustic Models

Naoyuki Kanda, Xugang Lu, Hisashi Kawai


This paper presents a novel decoding framework for connectionist temporal classification (CTC)-based acoustic models (AM). Although CTC-based AM inherently has the property of a language model (LM) in itself, an external LM trained with a large text corpus is still essential to obtain the best results. In the previous literatures, a naive interpolation of the CTC-based AM score and the external LM score was used, although there is no theoretical justification for it. In this paper, we propose a theoretically more sound decoding framework derived from a maximization of the posterior probability of a word sequence given an observation. In our decoding framework, a subword LM (SLM) is newly introduced to coordinate the CTC-based AM score and the word-level LM score. In experiments with the Wall Street Journal (WSJ) corpus and Corpus of Spontaneous Japanese (CSJ), our proposed framework consistently achieved improvements of 7.4–15.3% over the conventional interpolation-based framework. In the CSJ experiment, given 586 hours of training data, the CTC-based AM finally achieved a 6.7% better word error rate than the baseline method with deep neural networks and hidden Markov models.


DOI: 10.21437/Interspeech.2016-71

Cite as

Kanda, N., Lu, X., Kawai, H. (2016) Maximum a posteriori Based Decoding for CTC Acoustic Models. Proc. Interspeech 2016, 1868-1872.

Bibtex
@inproceedings{Kanda+2016,
author={Naoyuki Kanda and Xugang Lu and Hisashi Kawai},
title={Maximum a posteriori Based Decoding for CTC Acoustic Models},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-71},
url={http://dx.doi.org/10.21437/Interspeech.2016-71},
pages={1868--1872}
}