We address the problem of the identification (from text) and generation of pitch accents in HMM-based English TTS synthesis. We show, through a large scale perceptual test, that a large improvement of the binary discrimination between pitch accented and non-accented words has no effect on the quality of the speech generated by the system. On the other side adding a third accent type that emphatically marks words that convey "contrastive" focus (automatically identified from text) produces beneficial effects on the synthesized speech. These results support the accounts on prosodic prominence that consider the prosodic patterns of utterances as hierarchical structured and point out the limits of a flattening of such structure resulting from a simple accent/non-accent distinction.
Index Terms: speech synthesis, HMM, pitch accents, focus detection
Bibliographic reference. Badino, Leonardo / Clark, Robert A. J. / Wester, Mirjam (2012): "Towards hierarchical prosodic prominence generation in TTS synthesis", In INTERSPEECH-2012, 2398-2401.