The Parameterized Phoneme Identity Feature as a Continuous Real-Valued Vector for Neural Network Based Speech Synthesis

Zhengqi Wen, Ya Li, Jianhua Tao


In the speech synthesis systems, the phoneme identity feature indicated as the pronunciation unit is influenced by external contexts like the neighboring words and phonemes. This paper proposes to encode such relatedness and parameterize the pronunciation of the phoneme identity feature as a continuous real-valued vector. The vector, composed by a phoneme embedded vector (PEV) and a word embedded vector (WEV), is applied to substitute the binary vector whose representation is one-hot. It is realized in the word embedding model with the joint training structure where the PEV and WEV are learned together. The effectiveness of the proposed technique was evaluated by comparing it with the binary vector in the bidirectional long short term memory recurrent neural network (BLSTM-RNN) based speech synthesis systems. Improvement on the quality of the synthesized speech has been achieved from the proposed system, which proves the effectiveness of replacing the binary vector with the continuous real-valued vector in describing the phoneme identity feature.


DOI: 10.21437/Interspeech.2016-222

Cite as

Wen, Z., Li, Y., Tao, J. (2016) The Parameterized Phoneme Identity Feature as a Continuous Real-Valued Vector for Neural Network Based Speech Synthesis. Proc. Interspeech 2016, 2248-2252.

Bibtex
@inproceedings{Wen+2016,
author={Zhengqi Wen and Ya Li and Jianhua Tao},
title={The Parameterized Phoneme Identity Feature as a Continuous Real-Valued Vector for Neural Network Based Speech Synthesis},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-222},
url={http://dx.doi.org/10.21437/Interspeech.2016-222},
pages={2248--2252}
}