Segment Level Voice Conversion with Recurrent Neural Networks

Miguel Varela Ramos, Alan W. Black, Ramon Fernandez Astudillo, Isabel Trancoso, Nuno Fonseca


Voice conversion techniques aim to modify a subject’s voice characteristics in order to mimic the one’s of another person. Due to the difference in utterance length between source and target speaker, state of the art voice conversion systems often rely on a frame alignment pre-processing step. This step aligns the entire utterances with algorithms such as dynamic time warping (DTW) that introduce errors, hindering system performance. In this paper we present a new technique that avoids the alignment of entire utterances at frame level, while keeping the local context during training. For this purpose, we combine an RNN model with the use of phoneme or syllable-level information, obtained from a speech recognition system. This system segments the utterances into segments which then can be grouped into overlapping windows, providing the needed context for the model to learn the temporal dependencies. We show that with this approach, notable improvements can be attained over a state of the art RNN voice conversion system on the CMU ARCTIC database. It is also worth noting that with this technique it is possible to halve the training data size and still outperform the baseline.


 DOI: 10.21437/Interspeech.2017-1538

Cite as: Ramos, M.V., Black, A.W., Astudillo, R.F., Trancoso, I., Fonseca, N. (2017) Segment Level Voice Conversion with Recurrent Neural Networks. Proc. Interspeech 2017, 3414-3418, DOI: 10.21437/Interspeech.2017-1538.


@inproceedings{Ramos2017,
  author={Miguel Varela Ramos and Alan W. Black and Ramon Fernandez Astudillo and Isabel Trancoso and Nuno Fonseca},
  title={Segment Level Voice Conversion with Recurrent Neural Networks},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={3414--3418},
  doi={10.21437/Interspeech.2017-1538},
  url={http://dx.doi.org/10.21437/Interspeech.2017-1538}
}