5th International Conference on Spoken Language Processing
In this paper we describe a word clustering method for class-based n-gram model. The measurement for clustering is the entropy on a corpus different from the corpus for n-gram model estimation. The search method is based on the greedy algorithm. We applied this method to a Japanese EDR corpus and English Penn Treebank corpus. The perplexities of word-based n-gram model on EDR corpus and Penn Treebank are 153.1 and 203.5 respectively. And Those of class-based n-gram model, estimated through our method, are 146.4 and 136.0 respectively. The result tells us that our clustering methods is better than the Brown's method and the Ney's method called leaving-one-out.
Bibliographic reference. Mori, Shinsuke / Nishimura, Masafumi / Itoh, Nobuyasu (1998): "Word clustering for a word bi-gram model", In ICSLP-1998, paper 0989.