Different training and adaptation techniques for multilingual Automatic
Speech Recognition (ASR) are explored in the context of hybrid systems,
exploiting Deep Neural Networks (DNN) and Hidden Markov Models (HMM).
In multilingual DNN training, the hidden layers (possibly extracting
bottleneck features) are usually shared across languages, and the output
layer can either model multiple sets of language-specific senones or
one single universal IPA-based multilingual senone set. Both architectures
are investigated, exploiting and comparing different language adaptive
training (LAT) techniques originating from successful DNN-based speaker-adaptation.
More specifically, speaker adaptive training methods such as Cluster
Adaptive Training (CAT) and Learning Hidden Unit Contribution (LHUC)
are considered. In addition, a language adaptive output architecture
for IPA-based universal DNN is also studied and tested.
Experiments show that
LAT improves the performance and adaptation on the top layer further
improves the accuracy. By combining state-level minimum Bayes risk
(sMBR) sequence training with LAT, we show that a language adaptively
trained IPA-based universal DNN outperforms a monolingually sequence
trained model.
Cite as: Tong, S., Garner, P.N., Bourlard, H. (2017) An Investigation of Deep Neural Networks for Multilingual Speech Recognition Training and Adaptation. Proc. Interspeech 2017, 714-718, doi: 10.21437/Interspeech.2017-1242
@inproceedings{tong17_interspeech, author={Sibo Tong and Philip N. Garner and Hervé Bourlard}, title={{An Investigation of Deep Neural Networks for Multilingual Speech Recognition Training and Adaptation}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={714--718}, doi={10.21437/Interspeech.2017-1242} }