In th is paper, an accurate and com pact language model is proposed to cope robustly with data sparseness and task dependencies. This language model adopts new categories which are generated by continuously interpolating POS word-class categories and word categories using M A P estimation. The new categories can reflect word statistics efficiently without loosing accuracy and task-independent general word-characteristics (i.e. grammatical constraints captured by POS statistics) are embedded to prevent task-overtuning. This modeling reduces the model size to 50% of the conventional models. T he bi-directional word-cluster N-grams generated by this modeling have 3% lower perplexity measured on a matched domain and 15% lower on a mismatched domain compared to a conventi onal word 2-gram.
Cite as: Yamamoto, H., Sagisaka, Y. (1999) Part-of-speech n-gram and word n-gram fused language model. Proc. 6th European Conference on Speech Communication and Technology (Eurospeech 1999), 1803-1806, doi: 10.21437/Eurospeech.1999-362
@inproceedings{yamamoto99_eurospeech, author={Hirofumi Yamamoto and Yoshinori Sagisaka}, title={{Part-of-speech n-gram and word n-gram fused language model}}, year=1999, booktitle={Proc. 6th European Conference on Speech Communication and Technology (Eurospeech 1999)}, pages={1803--1806}, doi={10.21437/Eurospeech.1999-362} }