Speaker-Aware Neural Network Based Beamformer for Speaker Extraction in Speech Mixtures

Kateřina Žmolíková, Marc Delcroix, Keisuke Kinoshita, Takuya Higuchi, Atsunori Ogawa, Tomohiro Nakatani


In this work, we address the problem of extracting one target speaker from a multichannel mixture of speech. We use a neural network to estimate masks to extract the target speaker and derive beamformer filters using these masks, in a similar way as the recently proposed approach for extraction of speech in presence of noise. To overcome the permutation ambiguity of neural network mask estimation, which arises in presence of multiple speakers, we propose to inform the neural network about the target speaker so that it learns to follow the speaker characteristics through the utterance. We investigate and compare different methods of passing the speaker information to the network such as making one layer of the network dependent on speaker characteristics. Experiments on mixture of two speakers demonstrate that the proposed scheme can track and extract a target speaker for both closed and open speaker set cases.


 DOI: 10.21437/Interspeech.2017-667

Cite as: Žmolíková, K., Delcroix, M., Kinoshita, K., Higuchi, T., Ogawa, A., Nakatani, T. (2017) Speaker-Aware Neural Network Based Beamformer for Speaker Extraction in Speech Mixtures. Proc. Interspeech 2017, 2655-2659, DOI: 10.21437/Interspeech.2017-667.


@inproceedings{Žmolíková2017,
  author={Kateřina Žmolíková and Marc Delcroix and Keisuke Kinoshita and Takuya Higuchi and Atsunori Ogawa and Tomohiro Nakatani},
  title={Speaker-Aware Neural Network Based Beamformer for Speaker Extraction in Speech Mixtures},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={2655--2659},
  doi={10.21437/Interspeech.2017-667},
  url={http://dx.doi.org/10.21437/Interspeech.2017-667}
}