ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

Investigation of Practical Aspects of Single Channel Speech Separation for ASR

Jian Wu, Zhuo Chen, Sanyuan Chen, Yu Wu, Takuya Yoshioka, Naoyuki Kanda, Shujie Liu, Jinyu Li

Speech separation has been successfully applied as a front-end processing module of conversation transcription systems thanks to its ability to handle overlapped speech and its flexibility to combine with downstream tasks such as automatic speech recognition (ASR). However, a speech separation model often introduces target speech distortion, resulting in a sub-optimum word error rate (WER). In this paper, we describe our efforts to improve the performance of a single channel speech separation system. Specifically, we investigate a two-stage training scheme that firstly applies a feature level optimization criterion for pre-training, followed by an ASR-oriented optimization criterion using an end-to-end (E2E) speech recognition model. Meanwhile, to keep the model light-weight, we introduce a modified teacher-student learning technique for model compression. By combining those approaches, we achieve a absolute average WER improvement of 2.70% and 0.77% using models with less than 10M parameters compared with the previous state-of-the-art results on the LibriCSS dataset for utterance-wise evaluation and continuous evaluation, respectively.


doi: 10.21437/Interspeech.2021-921

Cite as: Wu, J., Chen, Z., Chen, S., Wu, Y., Yoshioka, T., Kanda, N., Liu, S., Li, J. (2021) Investigation of Practical Aspects of Single Channel Speech Separation for ASR. Proc. Interspeech 2021, 3066-3070, doi: 10.21437/Interspeech.2021-921

@inproceedings{wu21f_interspeech,
  author={Jian Wu and Zhuo Chen and Sanyuan Chen and Yu Wu and Takuya Yoshioka and Naoyuki Kanda and Shujie Liu and Jinyu Li},
  title={{Investigation of Practical Aspects of Single Channel Speech Separation for ASR}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={3066--3070},
  doi={10.21437/Interspeech.2021-921}
}