This paper presents a single-channel speech separation method implemented with a deep recurrent neural network (DRNN) using recurrent temporal restricted Boltzmann machines (RTRBM). Although deep neural network (DNN) based speech separation (denoising task) methods perform quite well compared to the conventional statistical model based speech enhancement techniques, in DNN-based methods, the temporal correlations across speech frames are often ignored, resulting in loss of spectral detail in the reconstructed output speech. In order to alleviate this issue, one RTRBM is employed for modelling the acoustic features of input (mixture) signal and two RTRBMs are trained for the two training targets (source signals). Each RTRBM attempts to model the abstractions present in the training data at each time step as well as the temporal dependencies in the training data. The entire network (consisting of three RTRBMs and one recurrent neural network) can be fine-tuned by the joint optimization of the DRNN with an extra masking layer which enforces a reconstruction constraint. The proposed method has been evaluated on the IEEE corpus and TIMIT dataset for speech denoising task. Experimental results have established that the proposed approach outperforms NMF and conventional DNN and DRNN-based speech enhancement methods.
Cite as: Samui, S., Chakrabarti, I., Ghosh, S.K. (2017) Deep Recurrent Neural Network Based Monaural Speech Separation Using Recurrent Temporal Restricted Boltzmann Machines. Proc. Interspeech 2017, 3622-3626, doi: 10.21437/Interspeech.2017-57
@inproceedings{samui17_interspeech, author={Suman Samui and Indrajit Chakrabarti and Soumya K. Ghosh}, title={{Deep Recurrent Neural Network Based Monaural Speech Separation Using Recurrent Temporal Restricted Boltzmann Machines}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={3622--3626}, doi={10.21437/Interspeech.2017-57} }