ISCA Archive SSW 2021
ISCA Archive SSW 2021

Multi-Scale Spectrogram Modelling for Neural Text-to-Speech

Ammar Abbas, Bajibabu Bollepalli, Alexis Moinet, Arnaud Joly, Penny Karanasou, Peter Makarov, Simon Slangens, Sri Karlapati, Thomas Drugman

We propose a novel Multi-Scale Spectrogram (MSS) modelling approach to synthesise speech with an improved coarse and fine-grained prosody. We present a generic multi-scale spectrogram prediction mechanism where the system first predicts coarser scale mel-spectrograms that capture the suprasegmental information in speech, and later uses these coarser scale melspectrograms to predict finer scale mel-spectrograms capturing fine-grained prosody. We present details for two specific versions of MSS called Word-level MSS and Sentence-level MSS where the scales in our system are motivated by the linguistic units. TheWord-level MSS models word, phoneme, and framelevel spectrograms while Sentence-level MSS models sentencelevel spectrogram in addition. Subjective evaluations show that Word-level MSS performs statistically significantly better compared to the baseline on two voices.


doi: 10.21437/SSW.2021-31

Cite as: Abbas, A., Bollepalli, B., Moinet, A., Joly, A., Karanasou, P., Makarov, P., Slangens, S., Karlapati, S., Drugman, T. (2021) Multi-Scale Spectrogram Modelling for Neural Text-to-Speech. Proc. 11th ISCA Speech Synthesis Workshop (SSW 11), 177-182, doi: 10.21437/SSW.2021-31

@inproceedings{abbas21_ssw,
  author={Ammar Abbas and Bajibabu Bollepalli and Alexis Moinet and Arnaud Joly and Penny Karanasou and Peter Makarov and Simon Slangens and Sri Karlapati and Thomas Drugman},
  title={{Multi-Scale Spectrogram Modelling for Neural Text-to-Speech}},
  year=2021,
  booktitle={Proc. 11th ISCA Speech Synthesis Workshop (SSW 11)},
  pages={177--182},
  doi={10.21437/SSW.2021-31}
}