INTERSPEECH 2014
15th Annual Conference of the International Speech Communication Association

Singapore
September 14-18, 2014

Deep Scattering Spectra with Deep Neural Networks for LVCSR Tasks

Tara N. Sainath (1), Vijayaditya Peddinti (2), Brian Kingsbury (1), Petr Fousek (1), Bhuvana Ramabhadran (1), David Nahamoo (1)

(1) IBM T.J. Watson Research Center, USA
(2) Johns Hopkins University, USA

Log-mel filterbank features, which are commonly used features for CNNs, can remove higher-resolution information from the speech signal. A novel technique, known as Deep Scattering Spectrum (DSS), addresses this issue and looks to preserve this information. DSS features have shown promise on TIMIT, both for classification and recognition. In this paper, we extend the use of DSS features for LVCSR tasks. First, we explore the optimal multi-resolution time and frequency scattering operations for LVCSR tasks. Next, we explore techniques to reduce the dimension of the DSS features. We also incorporate speaker adaptation techniques into the DSS features. Results on a 50 and 430 hour English Broadcast News task show that the DSS features provide between a 4–7% relative improvement in WER over log-mel features, within a state-of-the-art CNN framework which incorporates speaker-adaptation and sequence training. Finally, we show that DSS features are similar to multi-resolution log-mel + MFCCs, and similar improvements can be obtained with this representation.

Full Paper

Bibliographic reference.  Sainath, Tara N. / Peddinti, Vijayaditya / Kingsbury, Brian / Fousek, Petr / Ramabhadran, Bhuvana / Nahamoo, David (2014): "Deep scattering spectra with deep neural networks for LVCSR tasks", In INTERSPEECH-2014, 900-904.