EUROSPEECH 2003 - INTERSPEECH 2003
In this paper, we describe a perceptual voice recognition method to improve the naturalness of synthesized speech for Mandarin Chinese text-to-speech (TTS) baseline system. As a large TTS speech corpus, speech data always has different acoustic properties for different data recording conditions. Speech data recorded under different conditions can finally influence the naturalness of synthesized speech. Concerning this fact, we separate the speech data in a TTS corpus into several different voice classes based on an iterative voice recognition method, which is something like speaker recognition. Among each class, speech units will be considered to have the same voice characteristics. Based on the voice recognition result, a novel unit selection algorithm is performed to select better units to synthesize a more natural-sounding speech. Primary experiment shows the possibility and validity of the method.
Bibliographic reference. Zhou, Yi / Zu, Yiqing (2003): "Unit selection based on voice recognition", In EUROSPEECH-2003, 265-268.