Interspeech'2005 - Eurospeech
A major difference between the human auditory system and automatic speech recognition (ASR) lies in their representation of sound signals: whereas ASR uses a smoothed low-dimensional temporal and spectral representation of sound signals, our hearing system relies on extremely high-dimensional but temporally sparse spike trains. A strength of the latter representation is in the inherent coding of time, which is exploited by neuronal networks along the auditory pathway. We demonstrate ASR results using features purely derived from simulated spike trains of auditory nerve fibers (ANF) and a layer of octopus neurons. Octopus neurons located in the cochlear nucleus are known for their distinct temporal processing: they not only reject steady-state excitation and fire on signal onsets but also enhance the amplitude modulations of voiced speech. With multi-condition training we do not reach the performance of conventional mel-frequency cepstral coefficients (MFCC) features. With clean training however, our spike-based features performed similarly to MFCCs. Further, recognition scores in noise were improved when features derived from ANFs, which mainly represent spectral characteristics of speech signals, were combined with features derived from spike trains of octopus neurons. This result is promising given the relatively small number of neurons we used and the limitations in how the auditory model was interfaced to the ASR back end.
Bibliographic reference. Holmberg, Marcus / Gelbart, David / Ramacher, Ulrich / Hemmert, Werner (2005): "Automatic speech recognition with neural spike trains", In INTERSPEECH-2005, 1253-1256.