Speech Prosody 2004
A simplified four-layer RNN (recurrent neural network) based architecture is introduced to generate prosodic information for improving naturalness in Persian TTS (text-to-speech) systems. The proposed RNN uses the first two layers at word level and the last two layers at syllable level to provide the TTS system with major prosodic parameters, including: pitch contour, energy contour, length of syllables, length and onset time of vowels, and duration of pauses. The experimental results show improvement of accuracy in prediction of prosodic parameters, as compared to similar prosody generation systems of higher complexity.
Bibliographic reference. Farrokhi, Ali / Ghaemmaghami, Shahrokh / Sheikhan, Mansur (2004): "Estimation of prosodic information for Persian text-to-speech system using a recurrent neural network", In SP-2004, 475-478.