Convolution-augmented transformer (conformer) has recently shown competitive results in speech-domain applications, such as automatic speech recognition, continuous speech separation, and sound event detection. Conformer can capture both the short and long-term temporal sequence information by attending to the whole sequence at once with multi-head self-attention and convolutional neural network. However, the effectiveness of conformer in speech enhancement has not been demonstrated. In this paper, we propose an end-to-end speech enhancement architecture (SE-Conformer), incorporating a convolutional encoder–decoder and conformer, designed to be directly applied to the time-domain signal. We performed evaluations on both the VoiceBank-DEMAND Corpus (VCTK) and Librispeech datasets in terms of objective speech quality metrics. The experimental results show that the proposed model outperforms other competitive baselines in speech enhancement performance.
Cite as: Kim, E., Seo, H. (2021) SE-Conformer: Time-Domain Speech Enhancement Using Conformer. Proc. Interspeech 2021, 2736-2740, doi: 10.21437/Interspeech.2021-2207
@inproceedings{kim21h_interspeech, author={Eesung Kim and Hyeji Seo}, title={{SE-Conformer: Time-Domain Speech Enhancement Using Conformer}}, year=2021, booktitle={Proc. Interspeech 2021}, pages={2736--2740}, doi={10.21437/Interspeech.2021-2207} }