Statistical language models using n-grams are inadequate to model long distance syntactic and semantic dependencies in a language. The syntactic dependencies can be modeled using some grammatical representation of text, and semantic dependencies can be captured using a technique called latent semantic analysis (LSA). However, to model both these dependencies simultaneously, we need a unified framework to represent them. Towards this direction, we present here a mathematical framework, called syntactically enhanced LSA (SELSA) that augments a word with the syntactic tag of its preceding word within LSA framework. This leads to a statistical language model that uses the preceding syntactic information along with the long distance semantic information to assign probabilities to words. Preliminary experiments on WSJ corpus show that SELSA reduces the bi-gram perplexity by 33.92% compared to 36.33% reduction by LSA, however it generates better probabilities for syntactic-semantically regular words than LSA.
Cite as: Kanejiya, D., Kumar, A., Prasad, S. (2003) Statistical language modeling usingsyntactically enhanced LSA. Proc. Workshop on Spoken Language Processing, 93-100
@inproceedings{kanejiya03_wslp, author={Dharmendra Kanejiya and Arun Kumar and Surendra Prasad}, title={{Statistical language modeling usingsyntactically enhanced LSA}}, year=2003, booktitle={Proc. Workshop on Spoken Language Processing}, pages={93--100} }