Deep Mixture-of-Experts Models for Synthetic Prosodic-Contour Generation

Raul Fernandez


Deep recurrent neural networks have been shown to provide state-of-art performance when generating prosodic contours in a speech-synthesis system. These models benefit from the representational capacity obtained by increased compositionality across many layers. As larger amounts of data become available, larger and deeper architectures can be trained at the expense of obtaining models that are expensive both in terms of computation and latency. In this work we take an alternative approach and divide the learning among an ensemble of experts, each of which is a smaller and/or shallower learner whose predictions are then arbitrated by a switching module that assigns sequences of linguistic features to global, sequence-level posteriors, and uses this information to weigh the members of the ensemble. Compared with a single deep cascaded model, this approach is more parallelizable, and can be exploited to obtain a more efficient model in terms of computation (as measured by overall model-size reduction) and latency (as measured by reduction of parameters by branching). We present an architecture where the cluster assignment and prediction models can be trained simultaneously, and demonstrate such gains in efficiency without sacrificing the perceptual quality of the predictions in a subjective listening test.


 DOI: 10.21437/SSW.2019-47

Cite as: Fernandez, R. (2019) Deep Mixture-of-Experts Models for Synthetic Prosodic-Contour Generation. Proc. 10th ISCA Speech Synthesis Workshop, 263-268, DOI: 10.21437/SSW.2019-47.


@inproceedings{Fernandez2019,
  author={Raul Fernandez},
  title={{Deep Mixture-of-Experts Models for Synthetic Prosodic-Contour Generation}},
  year=2019,
  booktitle={Proc. 10th ISCA Speech Synthesis Workshop},
  pages={263--268},
  doi={10.21437/SSW.2019-47},
  url={http://dx.doi.org/10.21437/SSW.2019-47}
}