16th Annual Conference of the International Speech Communication Association

Dresden, Germany
September 6-10, 2015

Training Deep Bidirectional LSTM Acoustic Model for LVCSR by a Context-Sensitive-Chunk BPTT Approach

Kai Chen (1), Zhi-Jie Yan (2), Qiang Huo (2)

(1) USTC, China
(2) Microsoft, China

This paper presents a study of using deep bidirectional long short-term memory (DBLSTM) as acoustic model for DBLSTM-HMM based large vocabulary continuous speech recognition (LVCSR), where a context-sensitive-chunk (CSC) backpropagation through time (BPTT) approach is used to train DBLSTM by splitting each training sequence into chunks with appended contextual observations, and a (possibly overlapped) CSCs based decoding method is used for recognition. Our approach makes mini-batch based training on GPU more efficient and reduces the latency of DBLSTM-based LVCSR from a whole utterance to a short chunk. Evaluations have been made on Switchboard-I benchmark task. In comparison with epochwise BPTT training, our method can achieve about three times speed-up on a single GPU card. In comparison with a highly optimized DNN-HMM system trained by a frame-level cross entropy (CE) criterion, our CE-trained DBLSTM-HMM system achieves relative word error rate reductions of 9% and 5% on Eval2000 and RT03S testing sets, respectively.

Full Paper

Bibliographic reference.  Chen, Kai / Yan, Zhi-Jie / Huo, Qiang (2015): "Training deep bidirectional LSTM acoustic model for LVCSR by a context-sensitive-chunk BPTT approach", In INTERSPEECH-2015, 3600-3604.