EUROSPEECH 2003 - INTERSPEECH 2003
This paper addresses the question of how to design a large vocabulary recognition system so that it can simultaneously handle a sophisticated language model, perform state-of-the-art speaker adaptation, and run in one times real time^1 (1xRT). The architecture we propose is based on classical HMM Viterbi decoding, but uses an extremely fast initial speaker-independent decoding to estimate VTL warp factors, feature-space and model-space MLLR transformations that are used in a final speaker-adapted decoding. We present results on past Switchboard evaluation data that indicate that this strategy compares favorably to published unlimited-time systems (running in several hundred times real-time). Coincidentally, this is the system that IBM fielded in the 2003 EARS Rich Transcription evaluation.
Bibliographic reference. Saon, George / Zweig, Geoffrey / Kingsbury, Brian / Mangu, Lidia / Chaudhari, Upendra (2003): "An architecture for rapid decoding of large vocabulary conversational speech", In EUROSPEECH-2003, 1977-1980.