LSTM based voice conversion for laryngectomees

Luis Serrano, David Tavarez, Xabier Sarasola, Sneha Raman, Ibon Saratxaga, Eva Navas, Inma Hernaez

This paper describes a voice conversion system designed with the aim of improving the intelligibility and pleasantness of oesophageal voices. Two different systems have been built, one to transform the spectral magnitude and another one for the fundamental frequency, both based on DNNs. Ahocoder has been used to extract the spectral information (mel cepstral coefficients) and a specific pitch extractor has been developed to calculate the fundamental frequency of the oesophageal voices. The cepstral coefficients are converted by means of a LSTM network. The conversion of the intonation curve is implemented through two different LSTM networks, one dedicated to the voiced unvoiced detection and another one for the prediction of F0 from the converted cepstral coefficients. The experiments described here involve conversion from one oesophageal speaker to a specific healthy voice. The intelligibility of the signals has been measured with a Kaldi based ASR system. A preference test has been implemented to evaluate the subjective preference of the obtained converted voices comparing them with the original oesophageal voice. The results show that spectral conversion improves ASR while restoring the intonation is preferred by human listeners.

 DOI: 10.21437/IberSPEECH.2018-26

Cite as: Serrano, L., Tavarez, D., Sarasola, X., Raman, S., Saratxaga, I., Navas, E., Hernaez, I. (2018) LSTM based voice conversion for laryngectomees. Proc. IberSPEECH 2018, 122-126, DOI: 10.21437/IberSPEECH.2018-26.

  author={Luis Serrano and David Tavarez and Xabier Sarasola and Sneha Raman and Ibon Saratxaga and Eva Navas and Inma Hernaez},
  title={{LSTM based voice conversion for laryngectomees}},
  booktitle={Proc. IberSPEECH 2018},