15th Annual Conference of the International Speech Communication Association

September 14-18, 2014

Refined Inter-Segment Joining in Multi-Form Speech Synthesis

Alexander Sorin (1), Slava Shechtman (1), Vincent Pollet (2)

(1) IBM Research Haifa, Israel
(2) Nuance Communications, Belgium

In multi-form speech synthesis, speech output is constructed by splicing waveform segments and parametric speech segments which are generated from statistical models. The decision whether to use the waveform or the statistical parametric form is made per segment. This approach faces certain challenges in the context of inter-segment joining. In this work, we present a novel method whereby all non-contiguous joints are represented by statistically generated speech frames without compromising on naturalness. Speech frames surrounding non-contiguous joints between the waveform segments are re-generated from the models and optimized for concatenation. In addition, a novel pitch smoothing algorithm that preserves the original intonation trajectory while maintaining smoothness is applied. We implemented the spectrum and the pitch smoothing algorithms within a multi-form speech synthesis framework that employs a uniform parametric representation for the natural and statistically modeled speech segments. This framework facilitates pitch modification in natural segments. Subjective evaluation results reveal that the proposed smoothing methods significantly improve the perceived speech quality.

Full Paper

Bibliographic reference.  Sorin, Alexander / Shechtman, Slava / Pollet, Vincent (2014): "Refined inter-segment joining in multi-form speech synthesis", In INTERSPEECH-2014, 790-794.