5th International Conference on Spoken Language Processing

Sydney, Australia
November 30 - December 4, 1998

Word Clustering for A Word Bi-gram Model

Shinsuke Mori, Masafumi Nishimura, Nobuyasu Itoh

Tokyo Research Laboratory, IBM Japan, Japan

In this paper we describe a word clustering method for class-based n-gram model. The measurement for clustering is the entropy on a corpus different from the corpus for n-gram model estimation. The search method is based on the greedy algorithm. We applied this method to a Japanese EDR corpus and English Penn Treebank corpus. The perplexities of word-based n-gram model on EDR corpus and Penn Treebank are 153.1 and 203.5 respectively. And Those of class-based n-gram model, estimated through our method, are 146.4 and 136.0 respectively. The result tells us that our clustering methods is better than the Brown's method and the Ney's method called leaving-one-out.

Full Paper

Bibliographic reference.  Mori, Shinsuke / Nishimura, Masafumi / Itoh, Nobuyasu (1998): "Word clustering for a word bi-gram model", In ICSLP-1998, paper 0989.