Sparse Approximation of Gram Matrices for GMMN-based Speech Synthesis

Tomoki Koriyama, Shinnosuke Takamichi, Takao Kobayashi

This paper discusses a training method of speech synthesis framework using generative moment matching network (GMMN). GMMN is a deep generative model optimized by minimizing conditional maximum mean discrepancy (CMMD), and the GMMN-based speech synthesis system models the distribution of acoustic features. Although CMMD is computationally infeasible for a large amount of data, the reduction methods of computation complexity were not examined in the previous study. In this paper, we propose an approximation method based on random Fourier features (RFFs) and minibatch selection technique using K-means clustering. Experimental evaluations show that the proposed method outperformed the conventional one in the perception of inter-utterance variation.

 DOI: 10.21437/SSW.2019-27

Cite as: Koriyama, T., Takamichi, S., Kobayashi, T. (2019) Sparse Approximation of Gram Matrices for GMMN-based Speech Synthesis. Proc. 10th ISCA Speech Synthesis Workshop, 149-154, DOI: 10.21437/SSW.2019-27.

  author={Tomoki Koriyama and Shinnosuke Takamichi and Takao Kobayashi},
  title={{Sparse Approximation of Gram Matrices for GMMN-based Speech Synthesis}},
  booktitle={Proc. 10th ISCA Speech Synthesis Workshop},