To establish Expressive Text-to-speech synthesis, current research studies both the processing of input text and the rendering of natural expressive speech. Focusing on the former as a front-end task in the production of synthetic speech, this paper investigates a novel feature for predicting phrase boundary tone labels which transcribe local fundamental frequency (F0) changes frequently appearing at phrase end positions in expressive speech. To this end, we examined a kind of distribution-based semantic features consisting of i) word surface strings, ii) their part-of-speech tags taken from a phrase and iii) the pause existence/non-existence at the final position of the phrase, which are different from conventional numerically-expressed stylistic features such as positions and lengths and distances of the phrase. Through experiments on Japanese expressive speech such as conversational speech and advertisement speech, we confirmed that the proposed features attain performance equal to or better than conventional features. These results suggest that the distribution-based semantic features might be useful to predict phrase boundary rise labels for conversational speech and might be useful equal to conventional numerically-expressed stylistic feature for advertisement speech.
Bibliographic reference. Nakajima, Hideharu / Mizuno, Hideyuki / Yoshioka, Osamu / Takahashi, Satoshi (2013): "Which resemblance is useful to predict phrase boundary rise labels for Japanese expressive text-to-speech synthesis, numerically-expressed stylistic or distribution-based semantic?", In INTERSPEECH-2013, 1047-1051.