We propose a simple new representation for the FFT spectrum tailored
to statistical parametric speech synthesis. It consists of four feature
streams that describe magnitude, phase and fundamental frequency using
real numbers. The proposed feature extraction method does not attempt
to decompose the speech structure (e.g., into source+filter or harmonics+noise).
By avoiding the simplifications inherent in decomposition, we can dramatically
reduce the “phasiness” and “buzziness” typical
of most vocoders. The method uses simple and computationally cheap
operations and can operate at a lower frame rate than the 200 frames-per-second
typical in many systems. It avoids heuristics and methods requiring
approximate or iterative solutions, including phase unwrapping.
Two DNN-based acoustic models were built — from male and
female speech data — using the Merlin toolkit. Subjective comparisons
were made with a state-of-the-art baseline, using the STRAIGHT vocoder.
In all variants tested, and for both male and female voices, the proposed
method substantially outperformed the baseline. We provide source code
to enable our complete system to be replicated.
Cite as: Espic, F., Botinhao, C.V., King, S. (2017) Direct Modelling of Magnitude and Phase Spectra for Statistical Parametric Speech Synthesis. Proc. Interspeech 2017, 1383-1387, doi: 10.21437/Interspeech.2017-1647
@inproceedings{espic17_interspeech, author={Felipe Espic and Cassia Valentini Botinhao and Simon King}, title={{Direct Modelling of Magnitude and Phase Spectra for Statistical Parametric Speech Synthesis}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={1383--1387}, doi={10.21437/Interspeech.2017-1647} }