This paper investigates acoustic models for automatic speech recognition (ASR) using deep neural networks (DNNs) whose input is taken directly from windowed speech waveforms (WSW). After demonstrating the ability of these networks to automatically acquire internal representations that are similar to mel-scale filter-banks, an investigation into efficient DNN architectures for exploiting WSW features is performed. First, a modified bottleneck DNN architecture is investigated to capture dynamic spectrum information that is not well represented in the time domain signal. Second,the redundancies inherent in WSW based DNNs are considered. The performance of acoustic models defined over WSW features is compared to that obtained from acoustic models defined over mel frequency spectrum coefficient (MFSC) features on the Wall Street Journal (WSJ) speech corpus. It is shown that using WSW features results in a 3.0 percent increase in WER relative to that resulting from MFSC features on the WSJ corpus. However, when combined with MFSC features, a reduction in WER of 4.1 percent is obtained with respect to the best evaluated MFSC based DNN acoustic model.
Bibliographic reference. Bhargava, Mayank / Rose, Richard (2015): "Architectures for deep neural network based acoustic models defined over windowed speech waveforms", In INTERSPEECH-2015, 6-10.