In this work, we make a study on the effect of training set on statistical language modeling (SLM). A corpus selection system based on perplexity is presented. It is tested in two experiments: one is to select optimal training corpus for generating a domain-specific SLM; the other one is for generating an optimal SLM for a LVCSR system. The results show that the training corpus is important for the capability of SLM and our corpus selection system is powerful for optimal corpus selection. With the help of this system, we generated a SLM for a LVCSR system, which contributed 14.5%--17.7% relative character error reduction.
Cite as: Shen, X., Xu, B. (2001) The study of the effect of training set on statistical language modeling. Proc. 7th European Conference on Speech Communication and Technology (Eurospeech 2001), 721-724, doi: 10.21437/Eurospeech.2001-217
@inproceedings{shen01b_eurospeech, author={Xipeng Shen and Bo Xu}, title={{The study of the effect of training set on statistical language modeling}}, year=2001, booktitle={Proc. 7th European Conference on Speech Communication and Technology (Eurospeech 2001)}, pages={721--724}, doi={10.21437/Eurospeech.2001-217} }