End-to-End SpeakerBeam for Single Channel Target Speech Recognition

Marc Delcroix, Shinji Watanabe, Tsubasa Ochiai, Keisuke Kinoshita, Shigeki Karita, Atsunori Ogawa, Tomohiro Nakatani


End-to-end (E2E) automatic speech recognition (ASR) that directly maps a sequence of speech features into a sequence of characters using a single neural network has received a lot of attention as it greatly simplifies the training and decoding pipelines and enables optimizing the whole system E2E. Recently, such systems have been extended to recognize speech mixtures by inserting a speech separation mechanism into the neural network, allowing to output recognition results for each speaker in the mixture. However, speech separation suffers from a global permutation ambiguity issue, i.e. arbitrary mapping between source speakers and outputs. We argue that this ambiguity would seriously limit the practical use of E2E separation systems. SpeakerBeam has been proposed as an alternative to speech separation to mitigate the global permutation ambiguity. SpeakerBeam aims at extracting only a target speaker in a mixture based on his/her speech characteristics, thus avoiding the global permutation problem. In this paper, we combine SpeakerBeam and an E2E ASR system to allow E2E training of a target speech recognition system. We show promising target speech recognition results in mixtures of two speakers, and discuss interesting properties of the proposed system in terms of speech enhancement and diarization ability.


 DOI: 10.21437/Interspeech.2019-1856

Cite as: Delcroix, M., Watanabe, S., Ochiai, T., Kinoshita, K., Karita, S., Ogawa, A., Nakatani, T. (2019) End-to-End SpeakerBeam for Single Channel Target Speech Recognition. Proc. Interspeech 2019, 451-455, DOI: 10.21437/Interspeech.2019-1856.


@inproceedings{Delcroix2019,
  author={Marc Delcroix and Shinji Watanabe and Tsubasa Ochiai and Keisuke Kinoshita and Shigeki Karita and Atsunori Ogawa and Tomohiro Nakatani},
  title={{End-to-End SpeakerBeam for Single Channel Target Speech Recognition}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={451--455},
  doi={10.21437/Interspeech.2019-1856},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1856}
}