Waveform Modeling Using Stacked Dilated Convolutional Neural Networks for Speech Bandwidth Extension

Yu Gu, Zhen-Hua Ling


This paper presents a waveform modeling and generation method for speech bandwidth extension (BWE) using stacked dilated convolutional neural networks (CNNs) with causal or non-causal convolutional layers. Such dilated CNNs describe the predictive distribution for each wideband or high-frequency speech sample conditioned on the input narrowband speech samples. Distinguished from conventional frame-based BWE approaches, the proposed methods can model the speech waveforms directly and therefore avert the spectral conversion and phase estimation problems. Experimental results prove that the BWE methods proposed in this paper can achieve better performance than the state-of-the-art frame-based approach utilizing recurrent neural networks (RNNs) incorporating long short-term memory (LSTM) cells in subjective preference tests.


 DOI: 10.21437/Interspeech.2017-336

Cite as: Gu, Y., Ling, Z. (2017) Waveform Modeling Using Stacked Dilated Convolutional Neural Networks for Speech Bandwidth Extension. Proc. Interspeech 2017, 1123-1127, DOI: 10.21437/Interspeech.2017-336.


@inproceedings{Gu2017,
  author={Yu Gu and Zhen-Hua Ling},
  title={Waveform Modeling Using Stacked Dilated Convolutional Neural Networks for Speech Bandwidth Extension},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={1123--1127},
  doi={10.21437/Interspeech.2017-336},
  url={http://dx.doi.org/10.21437/Interspeech.2017-336}
}