5th International Conference on Spoken Language Processing
Tag definition in stochastic language models (n-grams and n-pos) is based on grouping together words with similar right and left context behavior. A modification of the n-gram model using multi-tagged words and unsupervised clustering was already introduced for French with a corpus of millions of non-tagged words. We present a variation of bi-pos language model where two tag sets are defined and assigned to each word (multi-tagged model) using grammatical information. Each tag set is based on different context behavior. We use linguistic expert knowledge and a simple automatic clustering procedure to obtain groups of words with similar left context behavior (first set of tags) and with similar right context (second set of tags). We propose a grammatical based model useful when no big text corpus is available and a performance increase has been observed when multi-tagged words are used because of its better adaptation to the language.
Bibliographic reference. Pastor, Julio / Colas, Josť / San-Segundo, Ruben / Pardo, Josť Manuel (1998): "An asymmetric stochastic language model based on multi-tagged words", In ICSLP-1998, paper 1108.