Speaker Adaptation of Acoustic Model using a Few Utterances in DNN-based Speech Synthesis Systems

Ivan Himawan, Sandesh Aryal, Iris Ouyang, Shukhan Ng, Pierre Lanchantin


Synthesizing a person’s voice from only a few utterances is a highly desirable feature for personalized text-to-speech systems. This can be achieved by adapting an existing speakerindependent model to a target speaker such that the speaker variabilities due to a mismatch between training and testing conditions are minimized. In deep neural network (DNN) based speech synthesis, directly fine-tuning a large number of parameters is susceptible to over-fitting problem, especially when the adaptation set is small. In this paper, we present a novel technique to estimate a speaker-specific model using a partial copy of the speaker-independent model by creating a separate parallel branch stemmed from the intermediate hidden layer of the base network. This allows the fine-tuning of a speaker-specific model to take into account the difference between the target speaker and a speaker-independent model output. Experimental results show that the proposed adaptation method achieves improved audio quality and higher speaker similarity compared to another DNN speaker adaptation technique.


 DOI: 10.21437/SSW.2019-9

Cite as: Himawan, I., Aryal, S., Ouyang, I., Ng, S., Lanchantin, P. (2019) Speaker Adaptation of Acoustic Model using a Few Utterances in DNN-based Speech Synthesis Systems. Proc. 10th ISCA Speech Synthesis Workshop, 45-50, DOI: 10.21437/SSW.2019-9.


@inproceedings{Himawan2019,
  author={Ivan Himawan and Sandesh Aryal and Iris Ouyang and Shukhan Ng and Pierre Lanchantin},
  title={{Speaker Adaptation of Acoustic Model using a Few Utterances in DNN-based Speech Synthesis Systems}},
  year=2019,
  booktitle={Proc. 10th ISCA Speech Synthesis Workshop},
  pages={45--50},
  doi={10.21437/SSW.2019-9},
  url={http://dx.doi.org/10.21437/SSW.2019-9}
}