CNN-LSTM Models for Multi-Speaker Source Separation Using Bayesian Hyper Parameter Optimization

Jeroen Zegers, Hugo Van hamme


In recent years there have been many deep learning approaches towards the multi-speaker source separation problem. Most use Long Short-Term Memory - Recurrent Neural Networks (LSTM-RNN) or Convolutional Neural Networks (CNN) to model the sequential behavior of speech. In this paper we propose a novel network for source separation using an encoder-decoder CNN and LSTM in parallel. Hyper parameters have to be chosen for both parts of the network and they are potentially mutually dependent. Since hyper parameter grid search has a high computational burden, random search is often preferred. However, when sampling a new point in the hyper parameter space, it can potentially be very close to a previously evaluated point and thus give little additional information. Furthermore, random sampling is as likely to sample in a promising area as in an hyper space area dominated with poor performing models. Therefore, we use a Bayesian hyper parameter optimization technique and find that the parallel CNN-LSTM outperforms the LSTM-only and CNN-only model.


 DOI: 10.21437/Interspeech.2019-2423

Cite as: Zegers, J., hamme, H.V. (2019) CNN-LSTM Models for Multi-Speaker Source Separation Using Bayesian Hyper Parameter Optimization. Proc. Interspeech 2019, 4589-4593, DOI: 10.21437/Interspeech.2019-2423.


@inproceedings{Zegers2019,
  author={Jeroen Zegers and Hugo Van hamme},
  title={{CNN-LSTM Models for Multi-Speaker Source Separation Using Bayesian Hyper Parameter Optimization}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={4589--4593},
  doi={10.21437/Interspeech.2019-2423},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2423}
}