We propose an unsupervised prominence prediction method for expressive speech synthesis. Prominence patterns are learned by statistical analysis of prosodic features extracted from speech data. The advantages of our unsupervised data-driven prominence prediction include: easy adaptation to new speakers, speech styles, and even languages without requiring expert knowledge or complicated linguistic rules. In this approach, first, prominence predictive prosodic features are extracted at the foot level. Next, the extracted prosodic features are clustered, each cluster representing a prominence level. Based on just-noticeable-differences of prosodic features, the optimal number of perceptually distinct prominence levels is determined. Finally, the proposed prominence prediction is applied to prosody prediction for unit selection speech synthesis. Perceptual evaluation results show a preference for a 4-level unsupervised prominence prediction over a rule-based baseline in terms of naturalness and expressiveness of synthesized speech.
Bibliographic reference. Mehrabani, Mahnoosh / Mishra, Taniya / Conkie, Alistair (2013): "Unsupervised prominence prediction for speech synthesis", In INTERSPEECH-2013, 1559-1563.