In multi-form speech synthesis, speech output is constructed by splicing waveform segments and parametric speech segments which are generated from statistical models. The decision whether to use the waveform or the statistical parametric form is made per segment. This approach faces certain challenges in the context of inter-segment joining. In this work, we present a novel method whereby all non-contiguous joints are represented by statistically generated speech frames without compromising on naturalness. Speech frames surrounding non-contiguous joints between the waveform segments are re-generated from the models and optimized for concatenation. In addition, a novel pitch smoothing algorithm that preserves the original intonation trajectory while maintaining smoothness is applied. We implemented the spectrum and the pitch smoothing algorithms within a multi-form speech synthesis framework that employs a uniform parametric representation for the natural and statistically modeled speech segments. This framework facilitates pitch modification in natural segments. Subjective evaluation results reveal that the proposed smoothing methods significantly improve the perceived speech quality.
Bibliographic reference. Sorin, Alexander / Shechtman, Slava / Pollet, Vincent (2014): "Refined inter-segment joining in multi-form speech synthesis", In INTERSPEECH-2014, 790-794.