This study proposes model and feature based strategies for automatic whispered speech recognition. Our goal is to compensate for the mismatch between neutral-trained recognizer models and parameters of whispered speech. We propose a pseudo-whisper generation from neutral speech samples for efficient acoustic model adaptation. The scheme is based on the popular Vector Taylor Series (VTS) algorithm. In the first step, a `background' model capturing a rough estimate of the target whispered speech characteristics from a small amount of whispered data is trained. Second, the target background model is utilized in the VTS strategy to establish broad phone classes (consonants and vowels) transformations for individual neutral utterances and transform them towards whisper. Finally, these pseudo-whisper samples are used to adapt neutral recognizer models towards whisper. This approach is evaluated together with Vocal Tract Length Normalization (VTLN) and Shift frequency transforms and show to greatly benefit recognition performance compared to a traditional whisper-adaptation approach. The absolute WER on the closed speakers whisper scenario has been reduced from 17.3% to 8.4% and the open speakers scenario from 27.7% to 17.5 %.
Bibliographic reference. Ghaffarzadegan, Shabnam / Bořil, Hynek / Hansen, John H. L. (2014): "Model and feature based compensation for whispered speech recognition", In INTERSPEECH-2014, 2420-2424.