INTERSPEECH 2008
9th Annual Conference of the International Speech Communication Association

Brisbane, Australia
September 22-26, 2008

Tandem Processing of Fepstrum Features

Vivek Tyagi

IBM India Research Lab, India

In our previous work [1, 2], we have introduced Fepstrum - an improved modulation spectrum estimation technique that overcomes certain theoretical as well as practical shortcomings in the previously published modulation spectrum related techniques[7, 8, 9]. In this paper, we provide further extensive ASR results using the Tandem processed Fepstrum features over the TIMIT corpus. The results are compared with TRAPS features derived from hierarchical and parallel structures of neural networks[3]. Unlike the multiple neural networks trained over multiple timefrequency patches or the frequency bands as in [3], we train a single neural network with the concatenated Fepstrum and MFCC features to derive Tandem(Fepstrum+MFCC) features. The resultant phoneme recognition accuracy of the concatenated Tandem(Fepstrum+MFCC)+MFCC feature is 76.5% on the TIMIT core test set and 77.6% on the complete test set making these one of the best reported results on the TIMIT continuous phoneme recognition task.

References

  1. V. Tyagi, "Fepstrum: An improved modualtion spectrum for ASR,", In the Proc. of Interspeech 2007, Antwerp, Belgium. (ISCA Archive, http://www.isca-speech.org/archive/interspeech_2007)
  2. V. Tyagi and C. Wellekens, "Fepstrum Representation of Speech Signal," In the Proc. of IEEE ASRU 2005, Cancun, Mexico.
  3. P. Schwarz, P. Matejka and J. Cernocky, "Hierarchical structures of neural networks for phoneme recognition," In the Proc. of IEEE ICASSP 2006.

Full Paper

Bibliographic reference.  Tyagi, Vivek (2008): "Tandem processing of fepstrum features", In INTERSPEECH-2008, 2246-2249.