10th Annual Conference of the International Speech Communication Association

Brighton, United Kingdom
September 6-10, 2009

Multimodal HMM-Based NAM-to-Speech Conversion

Viet-Anh Tran (1), Gérard Bailly (1), Hélène Lœvenbruck (1), Tomoki Toda (2)

(1) GIPSA, France
(2) NAIST, Japan

Although the segmental intelligibility of converted speech from silent speech using direct signal-to-signal mapping proposed by Toda et al. [1] is quite acceptable, listeners have sometimes difficulty in chunking the speech continuum into meaningful words due to incomplete phonetic cues provided by output signals. This paper studies another approach consisting in combining HMM-based statistical speech recognition and synthesis techniques, as well as training on aligned corpora, to convert silent speech to audible voice. By introducing phonological constraints, such systems are expected to improve the phonetic consistency of output signals. Facial movements are used in order to improve the performance of both recognition and synthesis procedures. The results show that including these movements improves the recognition rate by 6.2% and a final improvement of the spectral distortion by 2.7% is observed. The comparison between direct signal-to-signal and phonetic-based mappings is finally commented in this paper.

Full Paper

Bibliographic reference.  Tran, Viet-Anh / Bailly, Gérard / Lœvenbruck, Hélène / Toda, Tomoki (2009): "Multimodal HMM-based NAM-to-speech conversion", In INTERSPEECH-2009, 656-659.