INTERSPEECH 2008
9th Annual Conference of the International Speech Communication Association

Brisbane, Australia
September 22-26, 2008

Towards a Segmental Vocoder Driven by Ultrasound and Optical Images of the Tongue and Lips

Thomas Hueber (1), Gérard Chollet (2), Bruce Denby (3), Gérard Dreyfus (1), Maureen Stone (4)

(1) LE-ESPCI, France; (2) LTCI, France; (3) Université Pierre et Marie Curie, France; (4) University of Maryland, USA

This article presents a framework for a phonetic vocoder driven by ultrasound and optical images of the tongue and lips for a "silent speech interface" application. The system is built around an HMM-based visual phone recognition step which provides target phonetic sequences from a continuous visual observation stream. The phonetic target constrains the search for the optimal sequence of diphones that maximizes similarity to the input test data in visual space subject to a unit concatenation cost in the acoustic domain. The final speech waveform is generated using "Harmonic plus Noise Model" synthesis techniques. Experimental results are based on a one-hour continuous speech audiovisual database comprising ultrasound images of the tongue and both frontal and lateral view of the speaker's lips.

Full Paper

Bibliographic reference.  Hueber, Thomas / Chollet, Gérard / Denby, Bruce / Dreyfus, Gérard / Stone, Maureen (2008): "Towards a segmental vocoder driven by ultrasound and optical images of the tongue and lips", In INTERSPEECH-2008, 2028-2031.