Synthesizing a person’s voice from only a few utterances is a highly desirable feature for personalized text-to-speech systems. This can be achieved by adapting an existing speakerindependent model to a target speaker such that the speaker variabilities due to a mismatch between training and testing conditions are minimized. In deep neural network (DNN) based speech synthesis, directly fine-tuning a large number of parameters is susceptible to over-fitting problem, especially when the adaptation set is small. In this paper, we present a novel technique to estimate a speaker-specific model using a partial copy of the speaker-independent model by creating a separate parallel branch stemmed from the intermediate hidden layer of the base network. This allows the fine-tuning of a speaker-specific model to take into account the difference between the target speaker and a speaker-independent model output. Experimental results show that the proposed adaptation method achieves improved audio quality and higher speaker similarity compared to another DNN speaker adaptation technique.
Cite as: Himawan, I., Aryal, S., Ouyang, I., Ng, S., Lanchantin, P. (2019) Speaker Adaptation of Acoustic Model using a Few Utterances in DNN-based Speech Synthesis Systems. Proc. 10th ISCA Workshop on Speech Synthesis (SSW 10), 45-50, doi: 10.21437/SSW.2019-9
@inproceedings{himawan19_ssw, author={Ivan Himawan and Sandesh Aryal and Iris Ouyang and Shukhan Ng and Pierre Lanchantin}, title={{Speaker Adaptation of Acoustic Model using a Few Utterances in DNN-based Speech Synthesis Systems}}, year=2019, booktitle={Proc. 10th ISCA Workshop on Speech Synthesis (SSW 10)}, pages={45--50}, doi={10.21437/SSW.2019-9} }