An automatic speech recognition (ASR) performance has greatly improved with the introduction of convolutional neural network (CNN) or long-short term memory (LSTM) for acoustic modeling. Recently, a convolutional LSTM (CLSTM) has been proposed to directly use convolution operation within the LSTM blocks and combine the advantages of both CNN and LSTM structures into a single architecture. This paper presents the first attempt to use CLSTMs for acoustic modeling. In addition, we propose a new forward-backward architecture to exploit long-term left/right context efficiently. The proposed scheme combines forward and backward LSTMs at different time points of an utterance with the aim of modeling long term frame invariant information such as speaker characteristics, channel etc. Furthermore, the proposed forward-backward architecture can be trained with truncated back-propagation-through-time unlike conventional bidirectional LSTM (BLSTM) architectures. Therefore, we are able to train deeply stacked CLSTM acoustic models, which is practically challenging with conventional BLSTMs. Experimental results show that both CLSTM and forward-backward LSTM improve word error rates significantly compared to standard CNN and LSTM architectures.
Cite as: Karita, S., Ogawa, A., Delcroix, M., Nakatani, T. (2017) Forward-Backward Convolutional LSTM for Acoustic Modeling. Proc. Interspeech 2017, 1601-1605, doi: 10.21437/Interspeech.2017-554
@inproceedings{karita17_interspeech, author={Shigeki Karita and Atsunori Ogawa and Marc Delcroix and Tomohiro Nakatani}, title={{Forward-Backward Convolutional LSTM for Acoustic Modeling}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={1601--1605}, doi={10.21437/Interspeech.2017-554} }