Individuals with larynx (vocal folds) impaired have problems in controlling their glottal vibration, producing whispered speech with extreme hoarseness. Standard automatic speech recognition using only acoustic cues is typically ineffective for whispered speech because the corresponding spectral characteristics are distorted. Articulatory cues such as the tongue and lip motion may help in recognizing whispered speech since articulatory motion patterns are generally not affected. In this paper, we investigated whispered speech recognition for patients with reconstructed larynx using articulatory movement data. A data set with both acoustic and articulatory motion data was collected from a patient with surgically reconstructed larynx using an electromagnetic articulograph. Two speech recognition systems, Gaussian mixture model-hidden Markov model (GMM-HMM) and deep neural network-HMM (DNN-HMM), were used in the experiments. Experimental results showed adding either tongue or lip motion data to acoustic features such as mel-frequency cepstral coefficient (MFCC) significantly reduced the phone error rates on both speech recognition systems. Adding both tongue and lip data achieved the best performance.
Cite as: Cao, B., Kim, M., Mau, T., Wang, J. (2016) Recognizing Whispered Speech Produced by an Individual with Surgically Reconstructed Larynx Using Articulatory Movement Data. Proc. 7th Workshop on Speech and Language Processing for Assistive Technologies (SLPAT 2016), 80-86, doi: 10.21437/SLPAT.2016-14
@inproceedings{cao16_slpat, author={Beiming Cao and Myungjong Kim and Ted Mau and Jun Wang}, title={{Recognizing Whispered Speech Produced by an Individual with Surgically Reconstructed Larynx Using Articulatory Movement Data}}, year=2016, booktitle={Proc. 7th Workshop on Speech and Language Processing for Assistive Technologies (SLPAT 2016)}, pages={80--86}, doi={10.21437/SLPAT.2016-14} }