We describe the development of a speech recognition system for conversational telephone speech (CTS) that incorporates acoustic features estimated by multilayer perceptrons (MLP). The acoustic features are based on frame-level phone posterior probabilities, obtained by merging two different MLP estimators, one based on PLP-Tandem features, the other based on hidden activation TRAPs (HATs) features. This paper focuses on the challenges arising when incorporating these nonstandard features into a full-scale speech-to-text (STT) system, as used by SRI in the Fall 2004 DARPA STT evaluations. First, we developed a series of time-saving techniques for training feature MLPs on 1800 hours of speech. Second, we investigated which components of a multipass, multi-front-end recognition system are most profitably augmented with MLP features for best overall performance. The final system obtained achieved a 2% absolute (10% relative) WER reduction over a comparable baseline system that did not include Tandem/HATs MLP features.
Cite as: Zhu, Q., Stolcke, A., Chen, B.Y., Morgan, N. (2005) Using MLP features in SRI's conversational speech recognition system. Proc. Interspeech 2005, 2141-2144, doi: 10.21437/Interspeech.2005-695
@inproceedings{zhu05c_interspeech, author={Qifeng Zhu and Andreas Stolcke and Barry Y. Chen and Nelson Morgan}, title={{Using MLP features in SRI's conversational speech recognition system}}, year=2005, booktitle={Proc. Interspeech 2005}, pages={2141--2144}, doi={10.21437/Interspeech.2005-695} }