Whether to Pretrain DNN or not?: An Empirical Analysis for Voice Conversion

Nirmesh J. Shah, Hardik B. Sailor, Hemant A. Patil


Recently, Deep Neural Network (DNN)-based Voice Conversion (VC) techniques have become popular in the VC literature. These techniques suffer from the issue of overfitting due to less amount of available training data from a target speaker. To alleviate this, pre-training is used for better initialization of the DNN parameters, which leads to faster convergence of parameters. Greedy layerwise pre-training of the stacked Restricted Boltzmann Machine (RBM) or the stacked De-noising AutoEncoder (DAE) is used with extra available speaker-pairs‘ data. This pre-training is time-consuming and requires a separate network to learn the parameters of the network. In this work, we propose to analyze the DNN training strategies for the VC task, specifically with and without pre-training. In particular, we investigate whether an extra pre-training step could be avoided by using recent advances in deep learning. The VC experiments were performed on two VC Challenge (VCC) databases 2016 and 2018. Objective and subjective tests show that DNN trained with Adam optimization and Exponential Linear Unit (ELU) performed comparable or better than the pre-trained DNN without compromising on speech quality and speaker similarity of the converted voices.


 DOI: 10.21437/Interspeech.2019-2608

Cite as: Shah, N.J., Sailor, H.B., Patil, H.A. (2019) Whether to Pretrain DNN or not?: An Empirical Analysis for Voice Conversion. Proc. Interspeech 2019, 1586-1590, DOI: 10.21437/Interspeech.2019-2608.


@inproceedings{Shah2019,
  author={Nirmesh J. Shah and Hardik B. Sailor and Hemant A. Patil},
  title={{Whether to Pretrain DNN or not?: An Empirical Analysis for Voice Conversion}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={1586--1590},
  doi={10.21437/Interspeech.2019-2608},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2608}
}