Waveform-Based Speaker Representations for Speech Synthesis

Moquan Wan, Gilles Degottex, Mark J.F. Gales

Speaker adaptation is a key aspect of building a range of speech processing systems, for example personalised speech synthesis. For deep-learning based approaches, the model parameters are hard to interpret, making speaker adaptation more challenging. One widely used method to address this problem is to extract a fixed length vector as speaker representation and use this as an additional input to the task-specific model. This allows speaker-specific output to be generated, without modifying the model parameters. However, the speaker representation is often extracted in a task-independent fashion. This allows the same approach to be used for a range of tasks, but the extracted representation is unlikely to be optimal for the specific task of interest. Furthermore, the features from which the speaker representation is extracted are usually pre-defined, often a standard speech representation. This may limit the available information that can be used. In this paper, an integrated optimisation framework for building a task specific speaker representation, making use of all the available information, is proposed. Speech synthesis is used as the example task. The speaker representation is derived from raw waveform, incorporating text information via an attention mechanism. This paper evaluates and compares this framework with standard task-independent forms.

 DOI: 10.21437/Interspeech.2018-1154

Cite as: Wan, M., Degottex, G., Gales, M.J. (2018) Waveform-Based Speaker Representations for Speech Synthesis. Proc. Interspeech 2018, 897-901, DOI: 10.21437/Interspeech.2018-1154.

  author={Moquan Wan and Gilles Degottex and Mark J.F. Gales},
  title={Waveform-Based Speaker Representations for Speech Synthesis},
  booktitle={Proc. Interspeech 2018},