FSM and k-nearest-neighbor for corpus based video-realistic audio-visual synthesis

Christian Weiss

In this paper we introduce a corpus based 2D video-realistic audiovisual synthesis system. The system combines a concatenative Text-to-Speech (TTS) System with a concatenative Text-to-Visual (TTV) System to an audio lip-movement synchronized Text-to- Audio-Visual-Speech System (TTAVS). For the concatenative TTS we are using a Finite State Machine approach to select non-uniform variable-size audio segments. Analogue to the TTS a k-Nearest- Neighbor algorithm is applied to select the visual segments where we perform image filtering previous to the selection process to extract features which are used for the Euclidian distance measure to minimize distortions while concatenating the visual segments. We consider only the particular start-frame and end-frame between potential video-frame sequences for the Euclidian metric. The selection of the visual equivalence of the selected segments is based on a visemic transcription according to the phonemic transcription of the given input text. Due to using independent source databases for speech and video we synchronize the generated signals in a linear way. The resulting audio-visual utterance is audio lip-movement synchronized audio-visual speech. The system is adaptable easily to new speakers whether using a different speech or video source.

doi: 10.21437/Interspeech.2005-789

Cite as: Weiss, C. (2005) FSM and k-nearest-neighbor for corpus based video-realistic audio-visual synthesis. Proc. Interspeech 2005, 2537-2540, doi: 10.21437/Interspeech.2005-789

