CTC Training of Multi-Phone Acoustic Models for Speech Recognition

Olivier Siohan


Phone-sized acoustic units such as triphones cannot properly capture the long-term co-articulation effects that occur in spontaneous speech. For that reason, it is interesting to construct acoustic units covering a longer time-span such as syllables or words. Unfortunately, the frequency distribution of those units is such that a few high frequency units account for most of the tokens, while many units rarely occur. As a result, those units suffer from data sparsity and can be difficult to train. In this paper we propose a scalable data-driven approach to construct a set of salient units made of sequences of phones called M-phones. We illustrate that since the decomposition of a word sequence into a sequence of M-phones is ambiguous, those units are well suited to be used with a connectionist temporal classification (CTC) approach which does not rely on an explicit frame-level segmentation of the word sequence into a sequence of acoustic units. Experiments are presented on a Voice Search task using 12,500 hours of training data.


 DOI: 10.21437/Interspeech.2017-505

Cite as: Siohan, O. (2017) CTC Training of Multi-Phone Acoustic Models for Speech Recognition. Proc. Interspeech 2017, 709-713, DOI: 10.21437/Interspeech.2017-505.


@inproceedings{Siohan2017,
  author={Olivier Siohan},
  title={CTC Training of Multi-Phone Acoustic Models for Speech Recognition},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={709--713},
  doi={10.21437/Interspeech.2017-505},
  url={http://dx.doi.org/10.21437/Interspeech.2017-505}
}