This paper investigates the use of perceptually-motivated subband temporal envelope (STE) features and time-delay neural network (TDNN) denoising autoencoder (DAE) to improve deep neural network (DNN)-based automatic speech recognition (ASR). STEs are estimated by full-wave rectification and low-pass filtering of band-passed speech using a Gammatone filter-bank. TDNNs are used either as DAE or acoustic models. ASR experiments are performed on Aurora-4 corpus. STE features provide 2.2% and 3.7% relative word error rate (WER) reduction compared to conventional log-mel filter-bank (FBANK) features when used in ASR systems using DNN and TDNN as acoustic models, respectively. Features enhanced by TDNN DAE are better recognized with ASR system using DNN acoustic models than using TDNN acoustic models. Improved ASR performance is obtained when features enhanced by TDNN DAE are used in ASR system using DNN acoustic models. In this scenario, using STE features provides 9.8% relative WER reduction compared to when using FBANK features.
Cite as: Do, C.-T., Stylianou, Y. (2017) Improved Automatic Speech Recognition Using Subband Temporal Envelope Features and Time-Delay Neural Network Denoising Autoencoder. Proc. Interspeech 2017, 3832-3836, doi: 10.21437/Interspeech.2017-1096
@inproceedings{do17c_interspeech, author={Cong-Thanh Do and Yannis Stylianou}, title={{Improved Automatic Speech Recognition Using Subband Temporal Envelope Features and Time-Delay Neural Network Denoising Autoencoder}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={3832--3836}, doi={10.21437/Interspeech.2017-1096} }