Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions

Awni Hannun, Ann Lee, Qiantong Xu, Ronan Collobert


We propose a fully convolutional sequence-to-sequence encoder architecture with a simple and efficient decoder. Our model improves WER on LibriSpeech while being an order of magnitude more efficient than a strong RNN baseline. Key to our approach is a time-depth separable convolution block which dramatically reduces the number of parameters in the model while keeping the receptive field large. We also give a stable and efficient beam search inference procedure which allows us to effectively integrate a language model. Coupled with a convolutional language model, our time-depth separable convolution architecture improves by more than 22% relative WER over the best previously reported sequence-to-sequence results on the noisy LibriSpeech test set.


 DOI: 10.21437/Interspeech.2019-2460

Cite as: Hannun, A., Lee, A., Xu, Q., Collobert, R. (2019) Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions. Proc. Interspeech 2019, 3785-3789, DOI: 10.21437/Interspeech.2019-2460.


@inproceedings{Hannun2019,
  author={Awni Hannun and Ann Lee and Qiantong Xu and Ronan Collobert},
  title={{Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={3785--3789},
  doi={10.21437/Interspeech.2019-2460},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2460}
}