This paper presents a statistical parametric speech synthesis method using hidden trajectory model (HTM) for flexibly controlling the formant positions and bandwidths of synthetic speech. In an HTM, hidden formant trajectories are generated by a bidirectional filtering process on the time-aligned and phone-dependent formant targets. The observed cepstral features are constituted by a formant-related component, which is predicted from the hidden formant trajectories using a nonlinear and analytical function, and a residual component, which is modeled by context-dependent Gaussians. In this paper, we apply HTM-based acoustic modeling to speech synthesis. The distribution parameters of the formant targets are manipulated at synthesis time to control the characteristics of synthetic speech. In our implementation, the distributions of residual cepstra are estimated for each quinphone and the question set used in the decision-tree-based model clustering is tailored so as to acquire high controllability for vowels. Experimental results shows that this proposed method can achieve effective controllability on the formant positions and bandwidths while keeping almost the same naturalness as the conventional HMM-based approach.
Bibliographic reference. Cai, Ming-Qi / Ling, Zhen-Hua / Dai, Li-Rong (2014): "Formant-controlled speech synthesis using hidden trajectory model", In INTERSPEECH-2014, 1529-1533.