7th International Conference on Spoken Language Processing

September 16-20, 2002
Denver, Colorado, USA

Expressive Speech Synthesis Using a Concatenative Synthesizer

Murtaza Bulut (1), Shrikanth S. Narayanan (1), Ann K. Syrdal (2)

(1) University of Southern California, USA; (2) AT&T Labs - Research, USA

This paper describes an experiment in synthesizing four emotional states - anger, happiness, sadness and neutral - using a concatenative speech synthesizer. To achieve this, five emotionally (i.e., semantically) unbiased target sentences were prepared. Then, separate speech inventories, comprising the target diphones for each of the above emotions, were recorded. Using the 16 different combinations of prosody and inventory during synthesis resulted in 80 synthetic test sentences. The results were evaluated by conducting listening tests with 33 na´ve listeners. Synthesized anger was recognized with 86.1% accuracy, sadness with 89.1%, happiness with 44.2%, and neutral emotion with 81.8% accuracy. According to our results, anger was classified as inventory dominant and sadness and neutral as prosody dominant. Results were not sufficient to make similar conclusions regarding happiness. The highest recognition accuracies were achieved for sentences synthesized by using prosody and diphone inventory belonging to the same emotion.


Full Paper

Bibliographic reference.  Bulut, Murtaza / Narayanan, Shrikanth S. / Syrdal, Ann K. (2002): "Expressive speech synthesis using a concatenative synthesizer", In ICSLP-2002, 1265-1268.