Learning and Modeling Unit Embeddings for Improving HMM-based Unit Selection Speech Synthesis

Xiao Zhou, Zhen-Hua Ling, Zhi-Ping Zhou, Li-Rong Dai


This paper presents a method of learning and modeling unit embeddings using deep neutral networks (DNNs) to improve the performance of HMM-based unit selection speech synthesis. First, a DNN with an embedding layer is built to learn a fixed-length embedding vector for each phone-sized candidate unit in the corpus from scratch. Then, another two DNNs are constructed to map linguistic features toward the extracted unit vector of each phone. One of them employs the unit vectors of preceding phones as model input. At synthesis time, the L2 distances between the unit vectors predicted by these two DNNs and the ones derived from candidate units are integrated into the target cost and the concatenation cost of HMM-based unit selection speech synthesis respectively. Experimental results demonstrate that the unit vectors estimated using only acoustic features display phone-dependent clustering properties. Furthermore, integrating unit vector distances into cost functions, especially the concatenation cost, improves the naturalness of HMM-based unit selection speech synthesis in our experiments.


 DOI: 10.21437/Interspeech.2018-1198

Cite as: Zhou, X., Ling, Z., Zhou, Z., Dai, L. (2018) Learning and Modeling Unit Embeddings for Improving HMM-based Unit Selection Speech Synthesis. Proc. Interspeech 2018, 2509-2513, DOI: 10.21437/Interspeech.2018-1198.


@inproceedings{Zhou2018,
  author={Xiao Zhou and Zhen-Hua Ling and Zhi-Ping Zhou and Li-Rong Dai},
  title={Learning and Modeling Unit Embeddings for Improving HMM-based Unit Selection Speech Synthesis},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={2509--2513},
  doi={10.21437/Interspeech.2018-1198},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1198}
}