We describe a corpus-based approach to improving synthesized speech quality and present two useful cost functions for unit selection. One is pitch-synchronous cross correlation for concatenation costs to reduce the noise caused by phase mismatch at concatenation points. The other is a discontinuous cost function for internal and concatenation costs to eliminate unnecessary cost calculation. An evaluation showed that incorporating pitchsynchronous cross correlation cost was better than using a conventional cost function. In addition, an opinion test to assess the naturalness of the synthesized speech indicated that the proposed method was 0.7 points better on a sevenpoint MOS (Mean of Opinion Score) than the conventional system. This paper also discusses other improvements in the performance of text-to-speech systems. In this session, we will demonstrate our Japanese text-to-speech system.
Cite as: Nukaga, N., Kamoshida, R., Nagamatsu, K. (2004) Unit selection using pitch synchronous cross correlation for Japanese concatenative speech synthesis. Proc. 5th ISCA Workshop on Speech Synthesis (SSW 5), 43-48
@inproceedings{nukaga04_ssw, author={Nobuo Nukaga and Ryota Kamoshida and Kenji Nagamatsu}, title={{Unit selection using pitch synchronous cross correlation for Japanese concatenative speech synthesis}}, year=2004, booktitle={Proc. 5th ISCA Workshop on Speech Synthesis (SSW 5)}, pages={43--48} }