Speaker Adaptation in DNN-Based Speech Synthesis Using d-Vectors

Rama Doddipatla, Norbert Braunschweiler, Ranniery Maia

The paper presents a mechanism to perform speaker adaptation in speech synthesis based on deep neural networks (DNNs). The mechanism extracts speaker identification vectors, so-called d-vectors, from the training speakers and uses them jointly with the linguistic features to train a multi-speaker DNN-based text-to-speech synthesizer (DNN-TTS). The d-vectors are derived by applying principal component analysis (PCA) on the bottle-neck features of a speaker classifier network. At the adaptation stage, three variants are explored: (1) d-vectors calculated using data from the target speaker, or (2) d-vectors calculated as a weighted sum of d-vectors from training speakers, or (3) d-vectors calculated as an average of the above two approaches. The proposed method of unsupervised adaptation using the d-vector is compared with the commonly used i-vector based approach for speaker adaptation. Listening tests show that: (1) for speech quality, the d-vector based approach is significantly preferred over the i-vector based approach. All the d-vector variants perform similar for speech quality; (2) for speaker similarity, both d-vector and i-vector based adaptation were found to perform similar, except a small significant preference for the d-vector calculated as an average over the i-vector.

 DOI: 10.21437/Interspeech.2017-1038

Cite as: Doddipatla, R., Braunschweiler, N., Maia, R. (2017) Speaker Adaptation in DNN-Based Speech Synthesis Using d-Vectors. Proc. Interspeech 2017, 3404-3408, DOI: 10.21437/Interspeech.2017-1038.

  author={Rama Doddipatla and Norbert Braunschweiler and Ranniery Maia},
  title={Speaker Adaptation in DNN-Based Speech Synthesis Using d-Vectors},
  booktitle={Proc. Interspeech 2017},