ISCA Archive AQS 2003
ISCA Archive AQS 2003

From audio-only to audio and video Text-to-Speech

Juergen Schroeter, Eric Cosatto, Hans Peter Graf, Joern Ostermann

Assessing the quality of Text-to-Speech systems is a complex problem. Adding face synthesis - the animation of a talking head and its rendering to video - to a TTS system makes evaluation even more difficult.

This paper reports on progress made with the AT&T sample-based Visual TTS (VTTS) system. Our system incorporates unit-selection synthesis (now well known from Audio TTS) and a moderate-size recorded database of video segments that are modified and concatenated to render the desired output.

The higher the quality of a VTTS system, the more important it is to carefully evaluate all algorithmic choices. Naturally, subjective testing, although time consuming and expensive, has to be the ultimate measure. However, we used objective measures for quality assessment during the development phase of our system. For example, we found that accuracy and timeliness of lip closures and protrusions, turning points (where a speaker’s mouth changes direction from opening to closing), and overall smoothness of the articulation are very critical for achieving high quality.

At the workshop, we will give an overview of the architecture and the evaluation of the AT&T VTTS system. This system passes the Turing test of being "as good as recorded" for a significant fraction of all test sentences.

Cite as: Schroeter, J., Cosatto, E., Graf, H.P., Ostermann, J. (2003) From audio-only to audio and video Text-to-Speech. Proc. First ISCA Workshop on Auditory Quality of Systems, 117

  author={Juergen Schroeter and Eric Cosatto and Hans Peter Graf and Joern Ostermann},
  title={{From audio-only to audio and video Text-to-Speech}},
  booktitle={Proc. First ISCA Workshop on Auditory Quality of Systems},