September 22-25, 1997
High quality corpus-based synthetic speech requires minimization of prosodic and acoustic distortions between an ideal phoneme sequence and the actual waveform segments used to reproduce it. Our synthesis system concatenates phoneme-sized wave- form segments, without signal processing, selected from a large-scale speech database according to both prosodic and phonetic contextual suitability. This paper describes an approach to optimising such unit selection in speech synthesis by using voice source parameters and formant information, instead of selection based on cepstral features. We present results showing that formants and voice source parameters are more effective as acoustic features in the unit selection. These features can be estimated automatically from speech waveforms using the ARX joint estimation method. Results are compared with mel- frequency cepstrum coefficients (MFCC), previously used for unit selection, and both objective and subjective experiments showed that the new features outperformed the previous ones, and confirmed that the synthesized speech sounded much more natural.
Bibliographic reference. Ding, Wen / Campbell, Nick (1997): "Optimising unit selection with voice source and formants in the CHATR speech synthesis system", In EUROSPEECH-1997, 537-540.