Speech Prosody 2004
In this paper an adaptable acoustical architecture in a multilingual TTS system is presented. The whole architecture is designed to be a data-driven system. Modules comprising text preprocessing, grapheme-to-phoneme conversion, lexical stress detection, OOV-handling, symbolic prosody prediction, acoustic prosody prediction and unit selection with concatenation use machine learning techniques especially neural networks (NN) or language independent routines. The adaptable and scaleable architecture of the acoustic prosody generation module is built up by four sub-modules. While duration control uses a NN designed on the modified causal error correction architecture (CRCECNN), f0-generation utilizes a MLP NN. Within both NN modeling a partially Weight Decay (p-WD) method is applied to optimize each input vector dimension of the NNs. The p-WD method helps to select one of the highly correlated features in contrast to standard weight decay; hence through its penalty function we achieved a minimized input feature set. By the use of the third sub-module, which reuses the predictions of the optimized NNs, a hybrid architecture is established, as unit selection based on syllable prosody parameter criterions combines prosody selection with unit selection. Handling with a limited database makes a post processing unit necessary. Well emphasize the problem of finding optimal speech segments and an approach of segment selection using a global parameterized non-linear suitability function in combination with a modified multi-level Viterbi search algorithm. Preliminary acoustic ratings of the adapted TTS system to Slovenian language will be introduced.
Bibliographic reference. Erdem, Caglayan / Stergar, Janez / Horvat, Bogomir (2004): "An adaptable acoustic architecture in a multilingual TTS system", In SP-2004, 537-540.