Recently, neural networks have been used for not only phone recognition but also denoising and dereverberation. However, the conventional denoising deep autoencoder (DAE) based on the feed-forward structure is not capable of handling very long speech frames of reverberation. LSTM can be effectively trained to reduce the average error between the enhanced signal and the original clean signal by considering the effect of the long past time frames. In this paper, we demonstrate that considering as long as the maximum reverberation time of the database is effective. Since the effect of reverberation varies depending on the phone-class of the whole speech context, we augment the input of the autoencoder with the phone-class information of the past frames as well as the current frame and call this version of the LSTM autoencoder pLSTM. In the speech recognition experiment using the data set of Reverb Challenge 2014, the LSTM front-end reduced the WER of the multi-condition DNN-HMM by 14.5%, and the use of the phone class feature yielded in pLSTM further improvement of 7.5%. The performance with the pLSTM is comparable to that of pDAE, while the number of parameters is only 1/25-1/8.
Bibliographic reference. Mimura, Masato / Sakai, Shinsuke / Kawahara, Tatsuya (2015): "Speech dereverberation using long short-term memory", In INTERSPEECH-2015, 2435-2439.