ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Automatic Tuning of Loss Trade-offs without Hyper-parameter Search in End-to-End Zero-Shot Speech Synthesis

Seongyeon Park, Bohyung Kim, Tae-Hyun Oh

Recently, zero-shot TTS and VC methods have gained attention due to their practicality of being able to generate voices even unseen during training. Among these methods, zero-shot modifications of the VITS model have shown superior performance, while having useful properties inherited from VITS. However, the performance of VITS and VITS-based zero-shot models vary dramatically depending on how the losses are balanced. This can be problematic, as it requires a burdensome procedure of tuning loss balance hyper-parameters to find the optimal balance. In this work, we propose a novel framework that finds this optimum without search, by inducing the decoder of VITS-based models to its full reconstruction ability. With our framework, we show superior performance compared to baselines in zero-shot TTS and VC, achieving state-of-the-art performance. Furthermore, we show the robustness of our framework in various settings. We provide an explanation for the results in the discussion.


doi: 10.21437/Interspeech.2023-58

Cite as: Park, S., Kim, B., Oh, T.-H. (2023) Automatic Tuning of Loss Trade-offs without Hyper-parameter Search in End-to-End Zero-Shot Speech Synthesis. Proc. INTERSPEECH 2023, 4319-4323, doi: 10.21437/Interspeech.2023-58

@inproceedings{park23_interspeech,
  author={Seongyeon Park and Bohyung Kim and Tae-Hyun Oh},
  title={{Automatic Tuning of Loss Trade-offs without Hyper-parameter Search in End-to-End Zero-Shot Speech Synthesis}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
  pages={4319--4323},
  doi={10.21437/Interspeech.2023-58}
}