Transfer Learning with Bottleneck Feature Networks for Whispered Speech Recognition

Boon Pang Lim, Faith Wong, Yuyao Li, Jia Wei Bay


Previous work on whispered speech recognition has shown that acoustic models (AM) trained on whispered speech can somewhat classify unwhispered (neutral) speech sounds, but not vice versa. In fact, AMs trained purely on neutral speech completely fail to recognize whispered speech. Meanwhile, recipes used to train neutral AMs will work just as well for whispered speech, but such methods require a large volume of transcribed whispered speech which is expensive to gather. In this work, we propose and investigate the use of bottleneck feature networks to normalize differences between whispered and neutral speech modes. Our extensive experiments show that this type of speech variability can be effectively normalized. We also show that it is possible to transfer this knowledge from two source languages with whispered speech (Mandarin and English), to a new target language (Malay) without whispered speech. Furthermore, we report a substantial reduction in word error rate for cross-mode speech recognition, effectively demonstrate that it is possible to train acoustic models capable of classifying both types of speech without needing any additional whispered speech.


DOI: 10.21437/Interspeech.2016-250

Cite as

Lim, B.P., Wong, F., Li, Y., Bay, J.W. (2016) Transfer Learning with Bottleneck Feature Networks for Whispered Speech Recognition. Proc. Interspeech 2016, 1578-1582.

Bibtex
@inproceedings{Lim+2016,
author={Boon Pang Lim and Faith Wong and Yuyao Li and Jia Wei Bay},
title={Transfer Learning with Bottleneck Feature Networks for Whispered Speech Recognition},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-250},
url={http://dx.doi.org/10.21437/Interspeech.2016-250},
pages={1578--1582}
}