8th European Conference on Speech Communication and Technology

Geneva, Switzerland
September 1-4, 2003


An Architecture for Rapid Decoding of Large Vocabulary Conversational Speech

George Saon, Geoffrey Zweig, Brian Kingsbury, Lidia Mangu, Upendra Chaudhari

IBM T.J. Watson Research Center, USA

This paper addresses the question of how to design a large vocabulary recognition system so that it can simultaneously handle a sophisticated language model, perform state-of-the-art speaker adaptation, and run in one times real time^1 (1xRT). The architecture we propose is based on classical HMM Viterbi decoding, but uses an extremely fast initial speaker-independent decoding to estimate VTL warp factors, feature-space and model-space MLLR transformations that are used in a final speaker-adapted decoding. We present results on past Switchboard evaluation data that indicate that this strategy compares favorably to published unlimited-time systems (running in several hundred times real-time). Coincidentally, this is the system that IBM fielded in the 2003 EARS Rich Transcription evaluation.

