ISCA Archive SSW 2023
ISCA Archive SSW 2023

An analysis on the effects of speaker embedding choice in non auto-regressive TTS

Adriana Stan, Johannah O'Mahony

In this paper we introduce a first attempt on understandinghow a non-autoregressive factorised multi-speaker speech synthesis architecture exploits the information present in differentspeaker embedding sets. We analyse if jointly learning the representations, and initialising them from pretrained models determine any quality improvements for target speaker identities.In a separate analysis, we investigate how the different sets ofembeddings impact the network’s core speech abstraction (i.e.zero conditioned) in terms of speaker identity and representation learning. We show that, regardless of the used set of embeddings and learning strategy, the network can handle variousspeaker identities equally well, with barely noticeable variationsin speech output quality, and that speaker leakage within thecore structure of the synthesis system is inevitable in the standard training procedures adopted thus far.


doi: 10.21437/SSW.2023-21

Cite as: Stan, A., O'Mahony, J. (2023) An analysis on the effects of speaker embedding choice in non auto-regressive TTS. Proc. 12th ISCA Speech Synthesis Workshop (SSW2023), 134-138, doi: 10.21437/SSW.2023-21

@inproceedings{stan23_ssw,
  author={Adriana Stan and Johannah O'Mahony},
  title={{An analysis on the effects of speaker embedding choice in non auto-regressive TTS}},
  year=2023,
  booktitle={Proc. 12th ISCA Speech Synthesis Workshop (SSW2023)},
  pages={134--138},
  doi={10.21437/SSW.2023-21}
}