Although the segmental intelligibility of converted speech from silent speech using direct signal-to-signal mapping proposed by Toda et al.  is quite acceptable, listeners have sometimes difficulty in chunking the speech continuum into meaningful words due to incomplete phonetic cues provided by output signals. This paper studies another approach consisting in combining HMM-based statistical speech recognition and synthesis techniques, as well as training on aligned corpora, to convert silent speech to audible voice. By introducing phonological constraints, such systems are expected to improve the phonetic consistency of output signals. Facial movements are used in order to improve the performance of both recognition and synthesis procedures. The results show that including these movements improves the recognition rate by 6.2% and a final improvement of the spectral distortion by 2.7% is observed. The comparison between direct signal-to-signal and phonetic-based mappings is finally commented in this paper.
Bibliographic reference. Tran, Viet-Anh / Bailly, Gérard / Lœvenbruck, Hélène / Toda, Tomoki (2009): "Multimodal HMM-based NAM-to-speech conversion", In INTERSPEECH-2009, 656-659.