The Seventh ISCA Tutorial and Research Workshop on Speech Synthesis

Kyoto, Japan
September 22-24, 2010

Evaluating Prosody in Synthetic Speech with Online (Eye-Tracking) and Offline (Rating) Methods

Rajakrishnan Rajkumar, Michael White, Shari R. Speer, Kiwako Ito

Department of Linguistics, The Ohio State University, USA

This study examines the relationship between online processing effects observed in earlier eye-tracking experiments [1, 2] and offline quality ratings gathered for the synthetic and natural speech stimuli used in these experiments, along with their acoustic-prosodic properties. White et al. [2] reported that even high-quality synthetic speech failed to replicate the facilitative effect of contextually appropriate accent patterns found with human speech, while it produced a more robust intonational garden-path effect with contextually inappropriate patterns. They conjectured that both of these effects could be due to processing delays observed with the synthetic speech. In this paper, we present an acoustic analysis of the stimuli used in the eye-tracking experiments and an offline stimuli rating task, which was designed to investigate whether a context-independent measure of utterance quality could predict processing-based effects. The analysis reveals that for synthetic speech, longer adjectives—which provide more processing time—do facilitate anticipatory looks to the target. Larger values of F0 drop (difference between the F0 values of the adjective and following noun) also negatively influenced looks to the target and were negatively correlated with offline ratings, suggesting that this may be a specific acoustic factor that merits attention in future work on improving synthesis quality. Finally, the study shows that online measures of unconscious processing and offline measures of conscious judgments, taken together, can provide a more comprehensive evaluation of synthetic speech than either method alone.

References

  1. K. Ito and S. R. Speer, “Semantically-independent but contextually-dependent interpretation of contrastive accent,” in Prosodic categories: production, perception and comprehension, P. Prieto, S. Frota, and G. Elordieta, Eds. Springer, to appear
  2. M. White, R. Rajkumar, K. Ito, and S. R. Speer, “Eye tracking for the online evaluation of prosody in speech synthesis: Not so fast!” in Proc. of the 10th Annual Conference of the International Speech Communication Association (INTERSPEECH-09), 2009

Index Terms: speech synthesis, evaluation, prosody, eye tracking, unit selection

Full Paper

Bibliographic reference.  Rajkumar, Rajakrishnan / White, Michael / Speer, Shari R. / Ito, Kiwako (2010): "Evaluating prosody in synthetic speech with online (eye-tracking) and offline (rating) methods", In SSW7-2010, 276-281.