Sixth International Conference on Spoken Language Processing
(ICSLP 2000)

Beijing, China
October 16-20, 2000

Dynamic Selection of Feature Spaces for Robust Speech Recognition

Bhuvana Ramabhadran, Yuqing Gao, Michael Picheny

Human Language Technologies, IBM T. J. Watson Research Center, Yorktown Heights, NY, USA

Selection of acoustic features for robust speech recognition has been the subject of research for several years. In the past, algorithms that use feature vectors from multiple frequency bands [1], or employ techniques to switch between multiple feature streams [2] have been reported in the literature to handle robustness under different acoustic conditions. Acoustic models built out of different feature sets produce different kinds of recognition errors. In this paper, we propose a likelihood-based scheme to combine the acoustic feature vectors from multiple signal processing schemes within the decoding framework, in order to extract maximum benefit from these different acoustic feature vectors and models. The proposed technique is general enough to be applied to other pattern recognition fields, such as, OCR, handwriting recognition, etc. The fundamental idea behind this approach is to pick the set of features that classifies a frame of speech accurately with no apriori information about the phonetic class or acoustic channel that this speech comes from. Two methods of merging any set of acoustic features, such as, formant-based features, cepstral feature vectors, PLP features, LDA features etc., are presented here:

These merging algorithms provide an impressive reduction in error rate between 8% to 15% relative across a wide variety of wide-band, clean and noisy large vocabulary continuous speech recognition tasks. Much of this gain is from the reduced insertion and substitution errors. Using the approach presented in this paper, we have achieved better improved acoustic modeling without increasing the number of parameters, i.e. two 40K Gaussian systems, when merged perform better than a 180K Gaussian system trained on the better of the two feature spaces.

References

  1. K. Paliwal, "Spectral Subband Centroid Features for Speech recognition," ICASSP'98 pp. 617-620, Seattle, May, 1998
  2. L. Jiang, "Unified Decoding and Feature Representation for Improved Speech Recognition," Eurospeech'99, pp. 1331-1334, Budapest, 1999.


Full Paper

Bibliographic reference.  Ramabhadran, Bhuvana / Gao, Yuqing / Picheny, Michael (2000): "Dynamic selection of feature spaces for robust speech recognition", In ICSLP-2000, vol.3, 913-916.