Speaker Representations for Speaker Adaptation in Multiple Speakers’ BLSTM-RNN-Based Speech Synthesis

Yi Zhao, Daisuke Saito, Nobuaki Minematsu


Training a high quality acoustic model with a limited database and synthesizing a new speaker’s voice with a few utterances have been hot topics in deep neural network (DNN) based statistical parametric speech synthesis (SPSS). To solve these problems, we built a unified framework for speaker adaptive training as well as speaker adaptation on Bidirectional Long Short-Term Memory with Recurrent Neural Network (BLSTM-RNN) acoustic model. In this paper, we mainly focus on speaker identity control at the input layer of our framework. We have investigated i-vector and speaker code as different speaker representations when used in an augmented input vector, and also propose two approaches to estimate a new speaker’s code. Experimental results show that the speaker representations input to the first layer of acoustic model can effectively control speaker identity during speaker adaptive training, thus improving the synthesized speech quality of speakers included in training phase. For speaker adaptation, speaker code estimated from MFCCs can achieve higher preference than other speaker representations.


DOI: 10.21437/Interspeech.2016-506

Cite as

Zhao, Y., Saito, D., Minematsu, N. (2016) Speaker Representations for Speaker Adaptation in Multiple Speakers’ BLSTM-RNN-Based Speech Synthesis. Proc. Interspeech 2016, 2268-2272.

Bibtex
@inproceedings{Zhao+2016,
author={Yi Zhao and Daisuke Saito and Nobuaki Minematsu},
title={Speaker Representations for Speaker Adaptation in Multiple Speakers’ BLSTM-RNN-Based Speech Synthesis},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-506},
url={http://dx.doi.org/10.21437/Interspeech.2016-506},
pages={2268--2272}
}