Rapid Style Adaptation Using Residual Error Embedding for Expressive Speech Synthesis

Xixin Wu, Yuewen Cao, Mu Wang, Songxiang Liu, Shiyin Kang, Zhiyong Wu, Xunying Liu, Dan Su, Dong Yu, Helen Meng

Synthesizing expressive speech with appropriate prosodic variations, e.g., various styles, still has much room for improvement. Previous methods have explored to use manual annotations as conditioning attributes to provide variation information. However, the related training data are expensive to obtain and the annotated style codes can be ambiguous and unreliable. In this paper, we explore utilizing the residual error as conditioning attributes. The residual error is the difference between the prediction of a trained average model and the ground truth. We encode the residual error into a style embedding via a neural network-based error encoder. The embedding is then fed to the target synthesis model to provide information for modeling various style distributions more accurately. The average model and the error encoder are jointly optimized with the target synthesis model. Our proposed method has two advantages: 1) the embedding is automatically learned with no need of manual annotations, which helps overcome data sparsity and ambiguity limitations; 2) For any unseen audio utterance, the style embedding can be efficiently generated. This enables rapid adaptation to the desired style to be achieved with only one adaptation utterance. Experimental results show that our method outperforms the baseline in speech quality and style similarity.

 DOI: 10.21437/Interspeech.2018-1991

Cite as: Wu, X., Cao, Y., Wang, M., Liu, S., Kang, S., Wu, Z., Liu, X., Su, D., Yu, D., Meng, H. (2018) Rapid Style Adaptation Using Residual Error Embedding for Expressive Speech Synthesis. Proc. Interspeech 2018, 3072-3076, DOI: 10.21437/Interspeech.2018-1991.

  author={Xixin Wu and Yuewen Cao and Mu Wang and Songxiang Liu and Shiyin Kang and Zhiyong Wu and Xunying Liu and Dan Su and Dong Yu and Helen Meng},
  title={Rapid Style Adaptation Using Residual Error Embedding for Expressive Speech Synthesis},
  booktitle={Proc. Interspeech 2018},