ISCA Archive Interspeech 2008
ISCA Archive Interspeech 2008

A study of unsupervised clustering techniques for language modeling

Sangyun Hahn, Abhinav Sethy, Hong-Kwang Jeff Kuo, Bhuvana Ramabhadran

There has been recent interest in clustering text data to build topic-specific language models for large vocabulary speech recognition. In this paper, we studied various unsupervised clustering algorithms on several corpora. First we compared the clustering methods with quality metrics such as entropy and purity. Of the techniques studied, two-phase bisecting K-means achieved good performance with relatively fast speed. Then we performed speech recognition experiments on English and Arabic systems using the automatically derived topic-based language models. We obtained modest word error rate improvements, comparable to previously published studies. A careful analysis of the correlation between word error rate and the distribution of misrecognized words, including an information-gain metric, is presented.


doi: 10.21437/Interspeech.2008-266

Cite as: Hahn, S., Sethy, A., Kuo, H.-K.J., Ramabhadran, B. (2008) A study of unsupervised clustering techniques for language modeling. Proc. Interspeech 2008, 1598-1601, doi: 10.21437/Interspeech.2008-266

@inproceedings{hahn08b_interspeech,
  author={Sangyun Hahn and Abhinav Sethy and Hong-Kwang Jeff Kuo and Bhuvana Ramabhadran},
  title={{A study of unsupervised clustering techniques for language modeling}},
  year=2008,
  booktitle={Proc. Interspeech 2008},
  pages={1598--1601},
  doi={10.21437/Interspeech.2008-266}
}