10th Annual Conference of the International Speech Communication Association

Brighton, United Kingdom
September 6-10, 2009

Visuo-Phonetic Decoding Using Multi-Stream and Context-Dependent Models for an Ultrasound-Based Silent Speech Interface

Thomas Hueber (1), Elie-Laurent Benaroya (1), Gérard Chollet (2), Bruce Denby (3), Gérard Dreyfus (1), Maureen Stone (4)

(1) LE-ESPCI, France
(2) LTCI, France
(3) Université Pierre et Marie Curie, France
(4) University of Maryland at Baltimore, USA

Recent improvements are presented for phonetic decoding of continuous-speech from ultrasound and optical observations of the tongue and lips in a silent speech interface application. In a new approach to this critical step, the visual streams are modeled by context-dependent multi-stream Hidden Markov Models (CD-MSHMM). Results are compared to a baseline system using context-independent modeling and a visual feature fusion strategy, with both systems evaluated on a one-hour, phonetically balanced English speech database. Tongue and lip images are coded using PCA-based feature extraction techniques. The uttered speech signal, also recorded, is used to initialize the training of the visual HMMs. Visual phonetic decoding performance is evaluated successively with and without the help of linguistic constraints introduced via a 2.5k-word decoding dictionary.

Full Paper

Bibliographic reference.  Hueber, Thomas / Benaroya, Elie-Laurent / Chollet, Gérard / Denby, Bruce / Dreyfus, Gérard / Stone, Maureen (2009): "Visuo-phonetic decoding using multi-stream and context-dependent models for an ultrasound-based silent speech interface", In INTERSPEECH-2009, 640-643.