In statistical parametric speech synthesis (SPSS) systems using the high-quality vocoder, acoustic features such as mel-cepstrum coefficients and F0 are predicted from linguistic features in order to utilize the vocoder to generate speech waveforms. However, the generated speech waveform generally suffers from quality deterioration such as buzziness caused by utilizing the vocoder. Although several attempts such as improving an excitation model have been investigated to alleviate the problem, it is difficult to completely avoid it if the SPSS system is based on the vocoder. To overcome this problem, there have recently been attempts to directly model waveform samples. Superior performance has been demonstrated, but computation time and latency are still issues. With the aim to construct another type of DNN-based speech synthesizer with neither the vocoder nor computational explosion, we investigated direct modeling of frequency spectra and waveform generation based on phase recovery. In this framework, STFT spectral amplitudes that include harmonic information derived from F0 are directly predicted through a DNN-based acoustic model and we use Griffin and Lim’s approach to recover phase and generate waveforms. The experimental results showed that the proposed system synthesized speech without buzziness and outperformed one generated from a conventional system using the vocoder.
Cite as: Takaki, S., Kameoka, H., Yamagishi, J. (2017) Direct Modeling of Frequency Spectra and Waveform Generation Based on Phase Recovery for DNN-Based Speech Synthesis. Proc. Interspeech 2017, 1128-1132, doi: 10.21437/Interspeech.2017-488
@inproceedings{takaki17_interspeech, author={Shinji Takaki and Hirokazu Kameoka and Junichi Yamagishi}, title={{Direct Modeling of Frequency Spectra and Waveform Generation Based on Phase Recovery for DNN-Based Speech Synthesis}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={1128--1132}, doi={10.21437/Interspeech.2017-488} }