This article presents a framework for a phonetic vocoder driven by ultrasound and optical images of the tongue and lips for a "silent speech interface" application. The system is built around an HMM-based visual phone recognition step which provides target phonetic sequences from a continuous visual observation stream. The phonetic target constrains the search for the optimal sequence of diphones that maximizes similarity to the input test data in visual space subject to a unit concatenation cost in the acoustic domain. The final speech waveform is generated using "Harmonic plus Noise Model" synthesis techniques. Experimental results are based on a one-hour continuous speech audiovisual database comprising ultrasound images of the tongue and both frontal and lateral view of the speaker's lips.
Bibliographic reference. Hueber, Thomas / Chollet, Gérard / Denby, Bruce / Dreyfus, Gérard / Stone, Maureen (2008): "Towards a segmental vocoder driven by ultrasound and optical images of the tongue and lips", In INTERSPEECH-2008, 2028-2031.