First ISCA ITRW on Auditory Quality of Systems

April 23-25, 2003
Akademie Mont-Cenis, Germany

From Audio-Only to Audio And Video Text-to-Speech

Juergen Schroeter, Eric Cosatto, Hans Peter Graf, Joern Ostermann

AT&T Labs - Research, Florham Park, NJ, USA

Assessing the quality of Text-to-Speech systems is a complex problem. Adding face synthesis - the animation of a talking head and its rendering to video - to a TTS system makes evaluation even more difficult.

This paper reports on progress made with the AT&T sample-based Visual TTS (VTTS) system. Our system incorporates unit-selection synthesis (now well known from Audio TTS) and a moderate-size recorded database of video segments that are modified and concatenated to render the desired output.

The higher the quality of a VTTS system, the more important it is to carefully evaluate all algorithmic choices. Naturally, subjective testing, although time consuming and expensive, has to be the ultimate measure. However, we used objective measures for quality assessment during the development phase of our system. For example, we found that accuracy and timeliness of lip closures and protrusions, turning points (where a speakerís mouth changes direction from opening to closing), and overall smoothness of the articulation are very critical for achieving high quality.

At the workshop, we will give an overview of the architecture and the evaluation of the AT&T VTTS system. This system passes the Turing test of being "as good as recorded" for a significant fraction of all test sentences.

Bibliographic reference.  Schroeter, Juergen / Cosatto, Eric / Graf, Hans Peter / Ostermann, Joern (2003): "From audio-only to audio and video Text-to-Speech", In AQS-2003, 117 (abstract).