ISCA Archive Interspeech 2009
ISCA Archive Interspeech 2009

Multimodal HMM-based NAM-to-speech conversion

Viet-Anh Tran, Gérard Bailly, Hélène Lœvenbruck, Tomoki Toda

Although the segmental intelligibility of converted speech from silent speech using direct signal-to-signal mapping proposed by Toda et al. [1] is quite acceptable, listeners have sometimes difficulty in chunking the speech continuum into meaningful words due to incomplete phonetic cues provided by output signals. This paper studies another approach consisting in combining HMM-based statistical speech recognition and synthesis techniques, as well as training on aligned corpora, to convert silent speech to audible voice. By introducing phonological constraints, such systems are expected to improve the phonetic consistency of output signals. Facial movements are used in order to improve the performance of both recognition and synthesis procedures. The results show that including these movements improves the recognition rate by 6.2% and a final improvement of the spectral distortion by 2.7% is observed. The comparison between direct signal-to-signal and phonetic-based mappings is finally commented in this paper.

doi: 10.21437/Interspeech.2009-230

Cite as: Tran, V.-A., Bailly, G., Lœvenbruck, H., Toda, T. (2009) Multimodal HMM-based NAM-to-speech conversion. Proc. Interspeech 2009, 656-659, doi: 10.21437/Interspeech.2009-230

  author={Viet-Anh Tran and Gérard Bailly and Hélène Lœvenbruck and Tomoki Toda},
  title={{Multimodal HMM-based NAM-to-speech conversion}},
  booktitle={Proc. Interspeech 2009},