8th European Conference on Speech Communication and Technology

Geneva, Switzerland
September 1-4, 2003


Training Data Optimization for Language Model Adaptation

Xiaoshan Fang (1), Jianfeng Gao (2), Jianfeng Li (3), Huanye Sheng (1)

(1) Shanghai Jiao Tong University, China
(2) Microsoft Research Asia, China
(3) University of Science and Technology of China, China

Language model (LM) adaptation is a necessary step when the LM is applied to speech recognition. The task of LM adaptation is to use out-domain data to improve in-domain model's performance since the available in-domain (task-specific) data set is usually not large enough for LM training. LM adaptation faces two problems. One is the poor quality of the out-domain training data. The other is the mismatch between the n-gram distribution in out-domain data set and that in in-domain data set. This paper presents two methods, filtering and distribution adaptation, to solve them respectively. First, a bootstrapping method is presented to filter suitable portion from two large variable quality out-domain data sets for our task. Then a new algorithm is proposed to adjust the n-gram distribution of the two data sets to that of a task-specific but small data set. We consider preventing over-fitting problem in adaptation. All resulting models are evaluated on the realistic application of email dictation. Experiments show that each method achieves better performance, and the combined method achieves a perplexity reduction of 24% to 80%.

Full Paper

Bibliographic reference.  Fang, Xiaoshan / Gao, Jianfeng / Li, Jianfeng / Sheng, Huanye (2003): "Training data optimization for language model adaptation", In EUROSPEECH-2003, 1485-1488.