Fifth ISCA ITRW on Speech Synthesis

June 14-16, 2004
Pittsburgh, PA, USA

A Corpus-Based Approach to Expressive Speech Synthesis

E. Eide, A. Aaron, R. Bakis, W. Hamza, Michael Picheny, J. Pitrelli

IBM T.J. Watson Research Center, Yorktown Heights, NY, USA

Human speech communication can be thought of as comprising two channels - the words themselves, and the style in which they are spoken. Each of these channels carries information. Today's most-advanced text-to-speech (TTS) systems such as [1],[2],[3],[4] fall far short of human speech because they offer only a single, fixed style of delivery, independent of the message. In this paper, we describe the IBM Expressive TTS Engine, which is able to add another channel by offering five speaking styles. These are: neutral declarative, conveying good news, conveying bad news, asking a question, and showing contrastive emphasis. In addition to generating speech in these five styles, our TTS system is also able to generate paralinguistic events such as sighs, breaths, and filled pauses which further enrich the style channel. We describe our methods for generating and evaluating expressive synthetic speech and paralinguistic effects. We show significant perceptual differences between expressive and neutral synthetic speech for each of our speaking styles. In addition, we describe how users have been empowered to easily communicate the desired expression to the TTS engine through our extensions [5] of the Speech Synthesis Markup Language(SSML) [6].


  1. Eide, E. et al. Recent Improvements to the IBM Trainable Speech Synthesis System. Proc. ICASSP 2003, Hong Kong. Volume 1, pages 708-711.
  2. Black, A.W. and K. Lenzo. Building Voices in the Festival Speech Synthesis System.
  5. Eide, E., et al. Multilayered Extensions to the Speech Synthesis Markup Language for Describing Expressiveness. Proc. Eurospeech 2003. Geneva, Switzerland.
  6. Speech Synthesis Markup Language Version 1.0. W3C Working Draft. December, 2002.

Full Paper

Bibliographic reference.  Eide, E. / Aaron, A. / Bakis, R. / Hamza, W. / Picheny, Michael / Pitrelli, J. (2004): "A corpus-based approach to expressive speech synthesis", In SSW5-2004, 79-84.