Spectro-Temporal Modelling with Time-Frequency LSTM and Structured Output Layer for Voice Conversion

Runnan Li, Zhiyong Wu, Yishuang Ning, Lifa Sun, Helen Meng, Lianhong Cai


From speech, speaker identity can be mostly characterized by the spectro-temporal structures of spectrum. Although recent researches have demonstrated the effectiveness of employing long short-term memory (LSTM) recurrent neural network (RNN) in voice conversion, traditional LSTM-RNN based approaches usually focus on temporal evolutions of speech features only. In this paper, we improve the conventional LSTM-RNN method for voice conversion by employing the two-dimensional time-frequency LSTM (TFLSTM) to model spectro-temporal warping along both time and frequency axes. A multi-task learned structured output layer (SOL) is afterward adopted to capture the dependencies between spectral and pitch parameters for further improvement, where spectral parameter targets are conditioned upon pitch parameters prediction. Experimental results show the proposed approach outperforms conventional systems in speech quality and speaker similarity.


 DOI: 10.21437/Interspeech.2017-1122

Cite as: Li, R., Wu, Z., Ning, Y., Sun, L., Meng, H., Cai, L. (2017) Spectro-Temporal Modelling with Time-Frequency LSTM and Structured Output Layer for Voice Conversion. Proc. Interspeech 2017, 3409-3413, DOI: 10.21437/Interspeech.2017-1122.


@inproceedings{Li2017,
  author={Runnan Li and Zhiyong Wu and Yishuang Ning and Lifa Sun and Helen Meng and Lianhong Cai},
  title={Spectro-Temporal Modelling with Time-Frequency LSTM and Structured Output Layer for Voice Conversion},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={3409--3413},
  doi={10.21437/Interspeech.2017-1122},
  url={http://dx.doi.org/10.21437/Interspeech.2017-1122}
}