Evaluation of Block-Wise Parameter Generation for Statistical Parametric Speech Synthesis

Nobuyuki Nishizawa, Tomohiro Obara, Gen Hattori

We propose a method of changing the units of input features from states used conventionally to phonemes and moras to reduce the computational cost of deep neural networks (DNNs) with a hidden semi-Markov model structure for speech synthesis, which can model acoustic features and a temporal structure in a unified framework. Neural networks with very deep and wide structures have recently been applied successfully in the field of speech synthesis. However, such models have very high computational cost, so they are not being applied on platforms with limited resources. To solve this problem, we increased the length of time of DNN input units. We used phoneme or mora units, which are longer than the state units used conventionally. Increasing the length in time of units of input features reduces the number of DNN forward propagations required for speech synthesis, reducing the computational cost. Since a mora in Japanese exhibits isochronism, the duration can be represented more appropriately than the phoneme units expressing consonants and vowels of different lengths with one neural network. Experimental results indicate that compared with speech synthesis based on a DNN with frame inputs, computational cost can be reduced by 97\% without degrading the naturalness of the synthesized speech with the proposed method.

 DOI: 10.21437/SSW.2019-31

Cite as: Nishizawa, N., Obara, T., Hattori, G. (2019) Evaluation of Block-Wise Parameter Generation for Statistical Parametric Speech Synthesis. Proc. 10th ISCA Speech Synthesis Workshop, 172-176, DOI: 10.21437/SSW.2019-31.

  author={Nobuyuki Nishizawa and Tomohiro Obara and Gen Hattori},
  title={{Evaluation of Block-Wise Parameter Generation for Statistical Parametric Speech Synthesis}},
  booktitle={Proc. 10th ISCA Speech Synthesis Workshop},