ISCA Archive SSW 2021
ISCA Archive SSW 2021

Impact of Segmentation and Annotation in French end-to-end Synthesis

Martin Lenglet, Olivier Perrotin, Gérard Bailly

Audio books are commonly used to train text-to-speech models (TTS), as they offer large phonetic content with rather expressive pronunciation, but number and sizes of publicly available audio books corpora differ between languages. Moreover, the quality and accuracy of the available utterance segmentations are debatable. Yet, the impact of segmentation on the output synthesis is not well established. Additionally, utterances are generally used individually, without taking advantage of text level structuring information, even though they influence speaker reading. In this paper, we conduct a multidimensional evaluation of Tacotron2 trained on different segmentations and text level annotations of the same French corpus. We show that both spectrum accuracy and expressiveness depend on the segmentation used. In particular, a shorter segmentation, in addition with the annotation of paragraphs, benefits to spectrum reconstruction at the detriment of phrasing. Multidimensional analysis of mean opinion scores obtained with a MUSHRAexperiment revealed that phrasing was relatively more important than spectrum accuracy in perceptual judgement. This work serves as evidence that particular attention must be given to models evaluation, as well as how to use the training corpus to maximize synthesis characteristics of interest.

doi: 10.21437/SSW.2021-3

Cite as: Lenglet, M., Perrotin, O., Bailly, G. (2021) Impact of Segmentation and Annotation in French end-to-end Synthesis. Proc. 11th ISCA Speech Synthesis Workshop (SSW 11), 13-18, doi: 10.21437/SSW.2021-3

  author={Martin Lenglet and Olivier Perrotin and Gérard Bailly},
  title={{Impact of Segmentation and Annotation in French end-to-end Synthesis}},
  booktitle={Proc. 11th ISCA Speech Synthesis Workshop (SSW 11)},