ISCA Archive ICSLP 1998
ISCA Archive ICSLP 1998

Word clustering for a word bi-gram model

Shinsuke Mori, Masafumi Nishimura, Nobuyasu Itoh

In this paper we describe a word clustering method for class-based n-gram model. The measurement for clustering is the entropy on a corpus different from the corpus for n-gram model estimation. The search method is based on the greedy algorithm. We applied this method to a Japanese EDR corpus and English Penn Treebank corpus. The perplexities of word-based n-gram model on EDR corpus and Penn Treebank are 153.1 and 203.5 respectively. And Those of class-based n-gram model, estimated through our method, are 146.4 and 136.0 respectively. The result tells us that our clustering methods is better than the Brown's method and the Ney's method called leaving-one-out.


doi: 10.21437/ICSLP.1998-658

Cite as: Mori, S., Nishimura, M., Itoh, N. (1998) Word clustering for a word bi-gram model. Proc. 5th International Conference on Spoken Language Processing (ICSLP 1998), paper 0989, doi: 10.21437/ICSLP.1998-658

@inproceedings{mori98_icslp,
  author={Shinsuke Mori and Masafumi Nishimura and Nobuyasu Itoh},
  title={{Word clustering for a word bi-gram model}},
  year=1998,
  booktitle={Proc. 5th International Conference on Spoken Language Processing (ICSLP 1998)},
  pages={paper 0989},
  doi={10.21437/ICSLP.1998-658}
}