Speech Prosody 2004

Nara, Japan
March 23-26, 2004

Estimation of Prosodic Information for Persian Text-To-Speech System Using a Recurrent Neural Network

Ali Farrokhi (1), Shahrokh Ghaemmaghami (2), Mansur Sheikhan (1)

(1) Azad University, South Tehran Branch; (2) Sharif University of Technology, Tehran, Iran

A simplified four-layer RNN (recurrent neural network) based architecture is introduced to generate prosodic information for improving naturalness in Persian TTS (text-to-speech) systems. The proposed RNN uses the first two layers at word level and the last two layers at syllable level to provide the TTS system with major prosodic parameters, including: pitch contour, energy contour, length of syllables, length and onset time of vowels, and duration of pauses. The experimental results show improvement of accuracy in prediction of prosodic parameters, as compared to similar prosody generation systems of higher complexity.

Full Paper

Bibliographic reference.  Farrokhi, Ali / Ghaemmaghami, Shahrokh / Sheikhan, Mansur (2004): "Estimation of prosodic information for Persian text-to-speech system using a recurrent neural network", In SP-2004, 475-478.