Deep neural network based real-time speech vocoder with periodic and aperiodic inputs

Keiichiro Oura, Kazuhiro Nakamura, Kei Hashimoto, Yoshihiko Nankaku, Keiichi Tokuda


In this paper, we propose a framework for speech synthesis taking both periodic and aperiodic inputs. Recently, a method of modeling speech waveforms directly, called WaveNet [1], was proposed. WaveNet is able to model speech waveforms accurately and is able to generate natural speech directly, so it is being used, particularly as a speech vocoder [2], in various research [3, 4, 5]. However, it has an autoregressive structure that generates speech sample from the sequence of past speech samples, so parallel computation cannot be used for synthesis, and consequently real-time synthesis is not possible. It also uses pitch information as an auxiliary feature, so it is unable to generate waveforms with a pitch outside of the range in the training data [6], and even if a pitch within the range of the training data is specified, a waveform with a different pitch could be generated. To address these issues, we propose a method that uses periodic and aperiodic input signals to generate the speech sample sequence at once. With the proposed method, speech can be generated faster than real-time, and speech waveforms with pitch outside the range of the training data can be generated. We also conducted a subjective evaluation of the naturalness of the speech, which indicated better synthesized speech quality than WaveNet.


 DOI: 10.21437/SSW.2019-3

Cite as: Oura, K., Nakamura, K., Hashimoto, K., Nankaku, Y., Tokuda, K. (2019) Deep neural network based real-time speech vocoder with periodic and aperiodic inputs. Proc. 10th ISCA Speech Synthesis Workshop, 13-18, DOI: 10.21437/SSW.2019-3.


@inproceedings{Oura2019,
  author={Keiichiro Oura and Kazuhiro Nakamura and Kei Hashimoto and Yoshihiko Nankaku and Keiichi Tokuda},
  title={{Deep neural network based real-time speech vocoder with periodic and aperiodic inputs}},
  year=2019,
  booktitle={Proc. 10th ISCA Speech Synthesis Workshop},
  pages={13--18},
  doi={10.21437/SSW.2019-3},
  url={http://dx.doi.org/10.21437/SSW.2019-3}
}