In this paper the main differences between language modelling of Russian and English are examined. A Russian corpus and a comparable English corpus are described. The effects of high inflectionality in Russian and the relationship between the out-of-vocabulary rate and vocabulary size are investigated. Standard word and class N-gram language modelling techniques are applied to the two corpora and perplexity results are reported. A novel approach to the modelling of inflected languages is proposed and its efficacy compared with the other techniques.
Cite as: Whittaker, E.W.D., Woodland, P.C. (1998) Comparison of language modelling techniques for Russian and English. Proc. 5th International Conference on Spoken Language Processing (ICSLP 1998), paper 0967, doi: 10.21437/ICSLP.1998-676
@inproceedings{whittaker98b_icslp, author={Edward W. D. Whittaker and Philip C. Woodland}, title={{Comparison of language modelling techniques for Russian and English}}, year=1998, booktitle={Proc. 5th International Conference on Spoken Language Processing (ICSLP 1998)}, pages={paper 0967}, doi={10.21437/ICSLP.1998-676} }