Multi-Speaker Modeling for DNN-based Speech Synthesis Incorporating Generative Adversarial Networks

Hiroki Kanagawa, Yusuke Ijima


This paper presents a novel DNN-based speech synthesis method we derived from multi-speaker training data. In general, speaker-dependent modeling techniques based on generative adversarial networks (GANs) improve synthesized speech quality. However, they are inadequate for multi-speaker training because conventional discriminators cannot take into account speaker identity, which degrades anti-spoofing performance in GAN discriminators. We introduce two approaches as means to learn GAN speaker characteristics, i.e., auxiliary features and tasks. The first uses speaker codes as additional discriminator input. The second uses speaker identification as a means to verify that anti-spoofing verification methods are effective. Experimental results showed that our proposed techniques outperformed conventional and GAN-based methods.


 DOI: 10.21437/SSW.2019-8

Cite as: Kanagawa, H., Ijima, Y. (2019) Multi-Speaker Modeling for DNN-based Speech Synthesis Incorporating Generative Adversarial Networks. Proc. 10th ISCA Speech Synthesis Workshop, 40-44, DOI: 10.21437/SSW.2019-8.


@inproceedings{Kanagawa2019,
  author={Hiroki Kanagawa and Yusuke Ijima},
  title={{Multi-Speaker Modeling for DNN-based Speech Synthesis Incorporating Generative Adversarial Networks}},
  year=2019,
  booktitle={Proc. 10th ISCA Speech Synthesis Workshop},
  pages={40--44},
  doi={10.21437/SSW.2019-8},
  url={http://dx.doi.org/10.21437/SSW.2019-8}
}