Corpus Design Using Convolutional Auto-Encoder Embeddings for Audio-Book Synthesis

Meysam Shamsi, Damien Lolive, Nelly Barbot, Jonathan Chevelu


In this study, we propose an approach for script selection in order to design TTS speech corpora. A Deep Convolutional Neural Network (DCNN) is used to project linguistic information to an embedding space. The embedded representation of the corpus is then fed to a selection process to extract a subset of utterances which offers a good linguistic coverage while tending to limit the linguistic unit repetition. We present two selection processes: a clustering approach based on utterance distance and another method that tends to reach a target distribution of linguistic events. We compare the synthetic signal quality of the proposed methods to state of art methods objectively and subjectively. The subjective and objective measures confirm the performance of the proposed methods in order to design speech corpora with better synthetic speech quality. The perceptual test shows that our TTS global cost can be used as an alternative to synthetic overall quality.


 DOI: 10.21437/Interspeech.2019-2190

Cite as: Shamsi, M., Lolive, D., Barbot, N., Chevelu, J. (2019) Corpus Design Using Convolutional Auto-Encoder Embeddings for Audio-Book Synthesis. Proc. Interspeech 2019, 1531-1535, DOI: 10.21437/Interspeech.2019-2190.


@inproceedings{Shamsi2019,
  author={Meysam Shamsi and Damien Lolive and Nelly Barbot and Jonathan Chevelu},
  title={{Corpus Design Using Convolutional Auto-Encoder Embeddings for Audio-Book Synthesis}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={1531--1535},
  doi={10.21437/Interspeech.2019-2190},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2190}
}