Interspeech'2005 - Eurospeech
In this paper, we introduce a German viseme inventory for visemically transcribing text according to phonetic transcription. A viseme set like the one presented in this work is essential for speech-driven audio-visual synthesis due to the fact that the selection of appropriate video segments is based on the visemically transcribed input text.
For text-to-speech synthesis, a transcription of the input text into the phonemic representation is used, in order to avoid ambiguous meanings and to acquire the correct pronunciation of the underlying input text and to serve as labels in unit-selection-based synthesis systems. Likewise, the visual synthesis requires a transcription that represents - analogue to the phonemes - the visual counterpart which is called viseme in related literature and which also serves as a unit label in our data-driven video-realistic audio-visual synthesis system.
We worked out an inventory of German viseme classes in a SAMPA-like labelling and trained a model for automatic visemic transcription of given input text.
Bibliographic reference. Weiss, Christian / Aschenberner, Bianca (2005): "A German viseme-set for automatic transcription of input text used for audio-visual speech synthesis", In INTERSPEECH-2005, 2945-2948.