This paper presents a weighted multi-distribution deep belief network (wMD-DBN) for context-dependent statistical parametric speech synthesis. We have previously proposed the use of MD-DBN for speech synthesis, which models simultaneously both spectrum and fundamental frequency (F0), and has demonstrated the potential to generate high-dimensional spectra with high quality and to produce natural synthesized speech. However, the model showed only mediocre performance on low-dimensional data, such as the F0 and voiced/unvoiced (V/UV) flag, resulting in a vibrating pitch contour in the synthesized voice. To address this problem, this paper investigates the use of an extra weighting vector on the acoustic output layer of the MD-DBN. It reduces the dimensional imbalance between spectrum and pitch parameters by giving different weighting coefficients to the spectrum, F0 and the V/UV flag in the training procedure. Experimental results show that wMD-DBN can generate smoother pitch contours and improve the naturalness of the synthesized speech.
Bibliographic reference. Kang, Shiyin / Meng, Helen (2014): "Statistical parametric speech synthesis using weighted multi-distribution deep belief network", In INTERSPEECH-2014, 1959-1963.