5th European Conference on Speech Communication and Technology

Rhodes, Greece
September 22-25, 1997

A Novel Tree-Based Clustering Algorithm for Statistical Language Modeling

G. Damnati, J. Simonin

France Telecom CNET DIH/RCP, Lannion, France

In this paper, a new method to cluster words into classes is proposed in order to define a statistical language model. The purpose of this algorithm is to decrease the computational cost of the clustering task while not degrading speech recognition performance. The algorithm provides a bottom-up hierarchical clustering using the reciprocal neighbours method. This technique consists in merging several pairs of classes within a single iteration. Experiments on a spontaneous speech corpus are presented. Results are given both in terms of perplexity and word recognition error rate. We obtain a large reduction in the number of iterations necessary to build a classification tree and thus a CPU time reduction in building the model as well as a reduction in both perplexity and word error rate.

Full Paper

Bibliographic reference.  Damnati, G. / Simonin, J. (1997): "A novel tree-based clustering algorithm for statistical language modeling", In EUROSPEECH-1997, 2715-2718.