The paper studies the post processing in deep bidirectional Long Short-Term Memory (DBLSTM) based voice conversion, where the statistical parameters are optimized to generate speech that exhibits similar properties to target speech. However, there always exists residual error between converted speech and target one. We reformulate the residual error problem as speech restoration, which aims to recover the target speech samples from the converted ones. Specifically, we propose a denoising recurrent neural network (DeRNN) by introducing regularization during training to shape the distribution of the converted data in latent space. We compare the proposed approach with global variance (GV), modulation spectrum (MS) and recurrent neural network (RNN) based postfilters, which serve a similar purpose. The subjective test results show that the proposed approach significantly outperforms these conventional approaches in terms of quality and similarity.
Cite as: Wu, J., Huang, D.-Y., Xie, L., Li, H. (2017) Denoising Recurrent Neural Network for Deep Bidirectional LSTM Based Voice Conversion. Proc. Interspeech 2017, 3379-3383, doi: 10.21437/Interspeech.2017-694
@inproceedings{wu17f_interspeech, author={Jie Wu and D.-Y. Huang and Lei Xie and Haizhou Li}, title={{Denoising Recurrent Neural Network for Deep Bidirectional LSTM Based Voice Conversion}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={3379--3383}, doi={10.21437/Interspeech.2017-694} }