ISCA Archive Eurospeech 2001
ISCA Archive Eurospeech 2001

The study of the effect of training set on statistical language modeling

Xipeng Shen, Bo Xu

In this work, we make a study on the effect of training set on statistical language modeling (SLM). A corpus selection system based on perplexity is presented. It is tested in two experiments: one is to select optimal training corpus for generating a domain-specific SLM; the other one is for generating an optimal SLM for a LVCSR system. The results show that the training corpus is important for the capability of SLM and our corpus selection system is powerful for optimal corpus selection. With the help of this system, we generated a SLM for a LVCSR system, which contributed 14.5%--17.7% relative character error reduction.


doi: 10.21437/Eurospeech.2001-217

Cite as: Shen, X., Xu, B. (2001) The study of the effect of training set on statistical language modeling. Proc. 7th European Conference on Speech Communication and Technology (Eurospeech 2001), 721-724, doi: 10.21437/Eurospeech.2001-217

@inproceedings{shen01b_eurospeech,
  author={Xipeng Shen and Bo Xu},
  title={{The study of the effect of training set on statistical language modeling}},
  year=2001,
  booktitle={Proc. 7th European Conference on Speech Communication and Technology (Eurospeech 2001)},
  pages={721--724},
  doi={10.21437/Eurospeech.2001-217}
}