ISCA Archive SSW 2021
ISCA Archive SSW 2021

Comparing acoustic and textual representations of previous linguistic context for improving Text-to-Speech

Pilar Oplustil-Gallegos, Johannah O'Mahony, Simon King

Text alone does not contain sufficient information to predict the spoken form. Using additional information, such as the linguistic context, should improve Text-to-Speech naturalness in general, and prosody in particular. Most recent research on using context is limited to using textual features of adjacent utterances, extracted with large pre-trained language models such as BERT. In this paper, we compare multiple representations of linguistic context by conditioning a Text-to-Speech model on features of the preceding utterance. We experiment with three design choices: (1) acoustic vs. textual representations; (2) features extracted with large pre-trained models vs. features learnt jointly during training; and (3) representing context at the utterance level vs. word level. Our results show that appropriate representations of either text or acoustic context alone yield significantly better naturalness than a baseline that does not use context. Combining an utterance-level acoustic representation with a word-level textual representation gave the best results overall.


doi: 10.21437/SSW.2021-36

Cite as: Oplustil-Gallegos, P., O'Mahony, J., King, S. (2021) Comparing acoustic and textual representations of previous linguistic context for improving Text-to-Speech. Proc. 11th ISCA Speech Synthesis Workshop (SSW 11), 205-210, doi: 10.21437/SSW.2021-36

@inproceedings{oplustilgallegos21_ssw,
  author={Pilar Oplustil-Gallegos and Johannah O'Mahony and Simon King},
  title={{Comparing acoustic and textual representations of previous linguistic context for improving Text-to-Speech}},
  year=2021,
  booktitle={Proc. 11th ISCA Speech Synthesis Workshop (SSW 11)},
  pages={205--210},
  doi={10.21437/SSW.2021-36}
}