Fast and accurate synthesisers of audio-visual speech have a number of potential applications, including the improvement of oral language skills. An audio-visual speech synthesiser has been built which can generate high-resolution, colour animations of the oral region for any English sentence from text. The synthesiser uses a data-driven approach in which statistical models of visible oral gestures were trained on video-recordings of a real speaker. The advantage of this over synthesisers based on 3-D facial models is that i) the displayed mouth contains all the visible articulators including the teeth, tongue and skin shading, ii) the audio and visual components can be generated in synchrony, and iii) the animations can be generated in close to real-time. This audio-visual speech synthesiser is an extension of earlier work on a prototype video speech synthesiser capable of generating low-resolution, greyscale displays of number-word strings.
Cite as: Brooke, N.M., Scott, S.D. (1998) An audio-visual speech synthesiser. Proc. ETRW on Speech Technology in Language Learning (STiLL), 147-150
@inproceedings{brooke98_still, author={N. M. Brooke and S. D. Scott}, title={{An audio-visual speech synthesiser}}, year=1998, booktitle={Proc. ETRW on Speech Technology in Language Learning (STiLL)}, pages={147--150} }