Fifth ISCA ITRW on Speech Synthesis
June 14-16, 2004
The level of quality that can be achieved by modern concatenative text-to-speech synthesis heavily depends on the optimization criteria used in the unit selection process. While effective cost functions arise naturally in the assessment of prosodic characteristics, the criteria typically selected to quantify discontinuities at the speech signal level do not tightly reflect users’ perception of the resulting acoustic waveform. This paper introduces a novel discontinuity measure which jointly, albeit implicitly, accounts for both interframe incoherence and discrepancies in formant frequencies/ bandwidths. This metric is derived from a distinct feature extraction paradigm, eschewing general purpose Fourier analysis in favor of a separately optimized modal decomposition for each boundary region. This alternative transform framework preserves, by construction, those properties of the waveform which are globally relevant to each concatenation considered. Experimental evaluations are conducted to characterize the behavior of the new measure, first on a contiguity prediction task, and then via a systematic listening comparison using a conventional metric as baseline. The results underscores the viability of the proposed approach in quantifying the perception of discontinuity between acoustic units.
Bibliographic reference. Bellegarda, Jerome R. (2004): "A novel discontinuity metric for unit selection text-to-speech synthesis", In SSW5-2004, 133-138.