9th Annual Conference of the International Speech Communication Association

Brisbane, Australia
September 22-26, 2008

A Study of Unsupervised Clustering Techniques for Language Modeling

Sangyun Hahn (1), Abhinav Sethy (2), Hong-Kwang Jeff Kuo (2), Bhuvana Ramabhadran (2)

(1) University of Washington, USA; (2) IBM T.J. Watson Research Center, USA

There has been recent interest in clustering text data to build topic-specific language models for large vocabulary speech recognition. In this paper, we studied various unsupervised clustering algorithms on several corpora. First we compared the clustering methods with quality metrics such as entropy and purity. Of the techniques studied, two-phase bisecting K-means achieved good performance with relatively fast speed. Then we performed speech recognition experiments on English and Arabic systems using the automatically derived topic-based language models. We obtained modest word error rate improvements, comparable to previously published studies. A careful analysis of the correlation between word error rate and the distribution of misrecognized words, including an information-gain metric, is presented.

Full Paper

Bibliographic reference.  Hahn, Sangyun / Sethy, Abhinav / Kuo, Hong-Kwang Jeff / Ramabhadran, Bhuvana (2008): "A study of unsupervised clustering techniques for language modeling", In INTERSPEECH-2008, 1598-1601.