Sixth International Conference on Spoken Language Processing
(ICSLP 2000)

Beijing, China
October 16-20, 2000

Vector Space Representation of Language Probabilities Through SVD of N-Gram Matrix

Shiro Terashima, Kazuya Takeda, Fumitada Itakura

Center for Integrated Acoustic Information Research (CIAIR), Nagoya University, Japan

In this paper we introduce the vector space representation of the N-gram language model where vectors of K dimensions are given to both words and contexts, i.e., an N-1 word sequence, so that the scalar product of a ‘word vector’ and a ‘context vector’ gives the corresponding N-gram probability. The vector space representation is obtained from singular value decomposition (SVD) of the co-occurrence frequency matrix (CFM) of the context and the word. The effectiveness of the proposed representation is examined by determining how the number of N-gram parameters can be reduced through clustering and truncation of the dimensions defined on the given vector space. From the experimental results, it is confirmed that the number of model parameters can be reduced to less than 17.5% of the original number of model parameters and the proposed method is more effective than the word clustering method based on mutual information.

Full Paper

Bibliographic reference.  Terashima, Shiro / Takeda, Kazuya / Itakura, Fumitada (2000): "Vector space representation of language probabilities through SVD of n-gram matrix", In ICSLP-2000, vol.2, 995-998.