This paper presents a corpus-based approach to communicative speech synthesis. We chose "good news" style and "bad news" style for our initial attempt to synthesize speech that has appropriate expressiveness desired in human-human or human-machine dialog. We utilized 10-hour "neutral" style speech corpus as well as smaller corpora with good news and bad news styles, each consisting of two to three hours of speech from the same speaker. We trained target HMM models with each style and synthesized speech with unit databases containing speech with the relevant style as well as neutral speech. From the listening tests, we found out that intended communicative styles were comprehended by listeners and that considerably high mean opinion score on naturalness was achieved with rather small, style-specific corpora.
Cite as: Sakai, S., Ni, J., Maia, R., Tokuda, K., Tsuzaki, M., Toda, T., Kawai, H., Nakamura, S. (2007) Communicative speech synthesis with XIMERA: a first step. Proc. 6th ISCA Workshop on Speech Synthesis (SSW 6), 28-33
@inproceedings{sakai07_ssw, author={Shinsuke Sakai and Jinfu Ni and Ranniery Maia and Keiichi Tokuda and Minoru Tsuzaki and Tomoki Toda and Hisashi Kawai and Satoshi Nakamura}, title={{Communicative speech synthesis with XIMERA: a first step}}, year=2007, booktitle={Proc. 6th ISCA Workshop on Speech Synthesis (SSW 6)}, pages={28--33} }