Deep Bidirectional LSTM Modeling of Timbre and Prosody for Emotional Voice Conversion

Huaiping Ming, Dongyan Huang, Lei Xie, Jie Wu, Minghui Dong, Haizhou Li


Emotional voice conversion aims at converting speech from one emotion state to another. This paper proposes to model timbre and prosody features using a deep bidirectional long short-term memory (DBLSTM) for emotional voice conversion. A continuous wavelet transform (CWT) representation of fundamental frequency (F0) and energy contour are used for prosody modeling. Specifically, we use CWT to decompose F0 into a five-scale representation, and decompose energy contour into a ten-scale representation, where each feature scale corresponds to a temporal scale. Both spectrum and prosody (F0 and energy contour) features are simultaneously converted by a sequence to sequence conversion method with DBLSTM model, which captures both frame-wise and long-range relationship between source and target voice. The converted speech signals are evaluated both objectively and subjectively, which confirms the effectiveness of the proposed method.


DOI: 10.21437/Interspeech.2016-1053

Cite as

Ming, H., Huang, D., Xie, L., Wu, J., Dong, M., Li, H. (2016) Deep Bidirectional LSTM Modeling of Timbre and Prosody for Emotional Voice Conversion. Proc. Interspeech 2016, 2453-2457.

Bibtex
@inproceedings{Ming+2016,
author={Huaiping Ming and Dongyan Huang and Lei Xie and Jie Wu and Minghui Dong and Haizhou Li},
title={Deep Bidirectional LSTM Modeling of Timbre and Prosody for Emotional Voice Conversion},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-1053},
url={http://dx.doi.org/10.21437/Interspeech.2016-1053},
pages={2453--2457}
}