In this paper, we propose a framework for speech synthesis taking both periodic and aperiodic inputs. Recently, a method of modeling speech waveforms directly, called WaveNet [1], was proposed. WaveNet is able to model speech waveforms accurately and is able to generate natural speech directly, so it is being used, particularly as a speech vocoder [2], in various research [3, 4, 5]. However, it has an autoregressive structure that generates speech sample from the sequence of past speech samples, so parallel computation cannot be used for synthesis, and consequently real-time synthesis is not possible. It also uses pitch information as an auxiliary feature, so it is unable to generate waveforms with a pitch outside of the range in the training data [6], and even if a pitch within the range of the training data is specified, a waveform with a different pitch could be generated. To address these issues, we propose a method that uses periodic and aperiodic input signals to generate the speech sample sequence at once. With the proposed method, speech can be generated faster than real-time, and speech waveforms with pitch outside the range of the training data can be generated. We also conducted a subjective evaluation of the naturalness of the speech, which indicated better synthesized speech quality than WaveNet.
Cite as: Oura, K., Nakamura, K., Hashimoto, K., Nankaku, Y., Tokuda, K. (2019) Deep neural network based real-time speech vocoder with periodic and aperiodic inputs. Proc. 10th ISCA Workshop on Speech Synthesis (SSW 10), 13-18, doi: 10.21437/SSW.2019-3
@inproceedings{oura19_ssw, author={Keiichiro Oura and Kazuhiro Nakamura and Kei Hashimoto and Yoshihiko Nankaku and Keiichi Tokuda}, title={{Deep neural network based real-time speech vocoder with periodic and aperiodic inputs}}, year=2019, booktitle={Proc. 10th ISCA Workshop on Speech Synthesis (SSW 10)}, pages={13--18}, doi={10.21437/SSW.2019-3} }