Towards expressive prosody generation in TTS for reading aloud applications

Monica Dominguez, Alicia Burga, Mireia Farrús, Leo Wanner

Conversational interfaces involving text-to-speech (TTS) applications have improved expressiveness and overall naturalness to a reasonable extent in the last decades. Conversational features, such as speech acts, affective states and information structure have been instrumental to derive more expressive prosodic contours. However, synthetic speech is still perceived as monotonous, when a text that lacks those conversational features is read aloud in the interface, i.e. it is fed directly to the TTS application. In this paper, we propose a methodology for pre-processing raw texts before they arrive to the TTS application. The aim is to analyze syntactic and information (or communicative) structure, and then use the high-level linguistic features derived from the analysis to generate more expressive prosody in the synthesized speech. The proposed methodology encompasses a pipeline of four modules: (1) a tokenizer, (2) a syntactic parser, (3) a communicative parser, and (3) an SSML prosody tag converter. The implementation has been tested in an experimental setting for German, using web-retrieved articles. Perception tests show a considerable improvement in expressiveness of the synthesized speech when prosody is enriched automatically taking into account the communicative structure.

 DOI: 10.21437/IberSPEECH.2018-9

Cite as: Dominguez, M., Burga, A., Farrús, M., Wanner, L. (2018) Towards expressive prosody generation in TTS for reading aloud applications. Proc. IberSPEECH 2018, 40-44, DOI: 10.21437/IberSPEECH.2018-9.

