EUROSPEECH 2003 - INTERSPEECH 2003
Data sparsity is a major problem for data driven prosodic models. Being able to share prosodic data across speakers is a potential solution to this problem. This paper explores this potential solution by addressing two questions: 1) Does a larger less sparse model from a different speaker produce more natural speech than a small sparse model built from the original speaker? 2)Does a different speaker's larger model generate more unit selection errors than a small sparse model built from the original speaker? A unit selection approach is used to produce a lazy learning model of three English RP speaker's f0 and durational parameters. Speaker 1 (the target speaker) had a much smaller database (approximately one quarter to one fifth the size) of the other two. Speaker 2 was a female speaker with frequent mid phrase rises. Speaker 3 was a male speaker with a similar f0 range to speaker 1 and with a measured prosodic style suitable for news and financial text.
We apply the models created for speaker 2 (an inappropriate model) and speaker 3 (an appropriate model) to speaker 1 and compare the results. Three passages (of three to four sentences in length) from challenging prosodic genres (news report, poetry and personal email) were synthesised using the target speaker and each of the three models. The synthesised utterances were played to 15 native english subjects and rated using a 5 point MOS scale. In addition, 7 experienced speech engineers rated each word for errors on a three point scale: 1. Acceptable, 2. Poor, 3. Unacceptable.
The results suggest that a large model from an appropriate speaker does not sound more natural or produce fewer errors than a smaller model generated from the individual speaker's own data. In addition it shows that an inappropriate model does produce both less natural and more errors in the speech. High variance in both subject and materials analysis suggest both tests are far from ideal and that evaluation techniques for both error rate and naturalness need to improve.
Bibliographic reference. Aylett, Matthew / Fackrell, Justin / Rutten, Peter (2003): "My voice, your prosody: sharing a speaker specific prosody model across speakers in unit selection TTS", In EUROSPEECH-2003, 321-324.