15th Annual Conference of the International Speech Communication Association

September 14-18, 2014

Towards the Adaptation of Prosodic Models for Expressive Text-to-Speech Synthesis

Mathieu Avanzi (1), George Christodoulides (2), Damien Lolive (3), Elisabeth Delais-Roussarie (1), Nelly Barbot (3)

(1) LLF (UMR 7110), France
(2) Université catholique de Louvain, Belgium
(3) IRISA, France

This paper presents a preliminary study whose main aim is to characterize four distinct speaking styles according to a limited set of prosodic features, including the length of prosodic phrases (AP and IP), the distribution of stressed syllables, pitch register span, the duration of silent pauses, etc. The analysis was performed using semi-automatic procedures on a corpus consisting of 30 minutes of speech per style. The study focuses on four styles, all of which are “overtly addressed to a given audience”, but differ as to the nature of the audience (adults vs. children) and the desired impact of the address (“importance of being understood and convincing, or not”). Data analysis reveals that (a) dictation (addressed to children) and political speeches (addressed to adults) are different to the two other speaking styles (reading of novels and fairy tales) with respect to a specific set of prosodic cues; while (b) the speeches addressed to children differ from the ones addressed to adults, with respect to another set of prosodic cues (especially pitch register span). These results have an interesting practical application: refining the design of pre-processing prosodic modules in a text-to-speech system, in order to improve the expressivity of synthesized speech.

