ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

Normalization Driven Zero-Shot Multi-Speaker Speech Synthesis

Neeraj Kumar, Srishti Goel, Ankur Narang, Brejesh Lall

In this paper, we present a novel zero-shot multi-speaker speech synthesis approach (ZSM-SS) that leverages the normalization architecture and speaker encoder with non-autoregressive multi-head attention driven encoder-decoder architecture. Given an input text and a reference speech sample of an unseen person, ZSM-SS can generate speech in that person’s style in a zero-shot manner. Additionally, we demonstrate how the affine parameters of normalization help in capturing the prosodic features such as energy and fundamental frequency in a disentangled fashion and can be used to generate morphed speech output. We demonstrate the efficacy of our proposed architecture on multi-speaker VCTK[1] and LibriTTS [2] datasets, using multiple quantitative metrics that measure generated speech distortion and MOS, along with speaker embedding analysis of the proposed speaker encoder model.


doi: 10.21437/Interspeech.2021-441

Cite as: Kumar, N., Goel, S., Narang, A., Lall, B. (2021) Normalization Driven Zero-Shot Multi-Speaker Speech Synthesis. Proc. Interspeech 2021, 1354-1358, doi: 10.21437/Interspeech.2021-441

@inproceedings{kumar21c_interspeech,
  author={Neeraj Kumar and Srishti Goel and Ankur Narang and Brejesh Lall},
  title={{Normalization Driven Zero-Shot Multi-Speaker Speech Synthesis}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={1354--1358},
  doi={10.21437/Interspeech.2021-441}
}