Residual + Capsule Networks (ResCap) for Simultaneous Single-Channel Overlapped Keyword Recognition

Yan Xiong, Visar Berisha, Chaitali Chakrabarti


Overlapped speech poses a significant problem in a variety of applications in speech processing including speaker identification, speaker diarization, and speech recognition among others. To address it, existing systems combine source separation with algorithms for processing non-overlapped speech (e.g. source separation + follow-on speech recognition). In this paper we propose a modified network architecture to simultaneously recognize keywords from overlapped speech without explicitly having to perform source separation. We build our network by adding capsule layers to a ResNet architecture that has shown state-of-the-art performance on a traditional keyword recognition task. We evaluate the model on a series of 10-word overlapped keyword recognition experiments, using speaker dependent and speaker independent training. Results indicate that Residual + Capsule (ResCap) network shows marked improvement in recognizing overlapped speech, especially in experiments where there is a mismatch in the number of overlapped speakers between the training set and the test set.


 DOI: 10.21437/Interspeech.2019-2913

Cite as: Xiong, Y., Berisha, V., Chakrabarti, C. (2019) Residual + Capsule Networks (ResCap) for Simultaneous Single-Channel Overlapped Keyword Recognition. Proc. Interspeech 2019, 3337-3341, DOI: 10.21437/Interspeech.2019-2913.


@inproceedings{Xiong2019,
  author={Yan Xiong and Visar Berisha and Chaitali Chakrabarti},
  title={{Residual + Capsule Networks (ResCap) for Simultaneous Single-Channel Overlapped Keyword Recognition}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={3337--3341},
  doi={10.21437/Interspeech.2019-2913},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2913}
}