INTERSPEECH 2004 - ICSLP
Incorporating long-term (~500-1000 ms) temporal information using multi-layered perceptrons (MLPs) has improved performance on ASR tasks, especially when used to complement traditional shortterm (~25-100 ms) features. This paper further studies techniques for incorporating long-term temporal information in the acoustic model by presenting experiments showing: 1) that simply widening acoustic context by using more frames of full band speech energies as input to the MLP is suboptimal compared to a more constrained two-stage approach that first focuses on long-term temporal patterns in each critical band separately and then combines them, 2) that the best two-stage approach studied utilizes hidden activation values of MLPs trained on the log critical band energies (LCBEs) of 51 consecutive frames, and 3) that combining the best two-stage approach with conventional short-term features significantly reduces word error rates on the 2001 NIST Hub-5 conversational telephone speech (CTS) evaluation set with models trained using the Switchboard Corpus.
Bibliographic reference. Chen, Barry / Zhu, Qifeng / Morgan, Nelson (2004): "Learning long-term temporal features in LVCSR using neural networks", In INTERSPEECH-2004, 925-928.