Enhancing Myanmar Speech Synthesis with Linguistic Information and LSTM-RNN

Aye Mya Hlaing, Win Pa Pa, Ye Kyaw Thu

Recently, Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) has become an attractive architecture in speech synthesis for its ability to learn long time-dependencies. Contextual linguistic information is an important feature for naturalness in speech synthesis and using that feature in various speech synthesis models improves the quality of the synthesized speeches for languages. In this paper, LSTM-RNN was applied in Myanmar speech synthesis, and the importance of contextual linguistic features and the effect of applying explicit tone information in different architectures of LSTM-RNN was examined using our proposed Myanmar question set. Experiments of LSTM-RNN, and a hybrid system of DNN and LSTM-RNN, i.e., four feedforward hidden layers followed by two LSTMRNN layers, were done on Myanmar speech synthesis and compared with the baseline DNN. Both objective and subjective evaluations show that the hybrid of DNN and LSTM-RNN system gives more satisfiable synthesized speeches for Myanmar language than the LSTM-RNN and baseline DNN systems.

 DOI: 10.21437/SSW.2019-34

Cite as: Hlaing, A.M., Pa, W.P., Thu, Y.K. (2019) Enhancing Myanmar Speech Synthesis with Linguistic Information and LSTM-RNN. Proc. 10th ISCA Speech Synthesis Workshop, 189-193, DOI: 10.21437/SSW.2019-34.

  author={Aye Mya Hlaing and Win Pa Pa and Ye Kyaw Thu},
  title={{Enhancing Myanmar Speech Synthesis with Linguistic Information and LSTM-RNN}},
  booktitle={Proc. 10th ISCA Speech Synthesis Workshop},