Recent improvements are presented for phonetic decoding of continuous-speech from ultrasound and optical observations of the tongue and lips in a silent speech interface application. In a new approach to this critical step, the visual streams are modeled by context-dependent multi-stream Hidden Markov Models (CD-MSHMM). Results are compared to a baseline system using context-independent modeling and a visual feature fusion strategy, with both systems evaluated on a one-hour, phonetically balanced English speech database. Tongue and lip images are coded using PCA-based feature extraction techniques. The uttered speech signal, also recorded, is used to initialize the training of the visual HMMs. Visual phonetic decoding performance is evaluated successively with and without the help of linguistic constraints introduced via a 2.5k-word decoding dictionary.
Cite as: Hueber, T., Benaroya, E.-L., Chollet, G., Denby, B., Dreyfus, G., Stone, M. (2009) Visuo-phonetic decoding using multi-stream and context-dependent models for an ultrasound-based silent speech interface. Proc. Interspeech 2009, 640-643, doi: 10.21437/Interspeech.2009-226
@inproceedings{hueber09_interspeech, author={Thomas Hueber and Elie-Laurent Benaroya and Gérard Chollet and Bruce Denby and Gérard Dreyfus and Maureen Stone}, title={{Visuo-phonetic decoding using multi-stream and context-dependent models for an ultrasound-based silent speech interface}}, year=2009, booktitle={Proc. Interspeech 2009}, pages={640--643}, doi={10.21437/Interspeech.2009-226} }