11th Annual Conference of the International Speech Communication Association

Makuhari, Chiba, Japan
September 26-30. 2010

Impact of Word Classing on Shrinkage-Based Language Models

Ruhi Sarikaya, Stanley F. Chen, Abhinav Sethy, Bhuvana Ramabhadran

IBM T.J. Watson Research Center, USA

This paper investigates the impact of word classing on a recently proposed shrinkage-based language model, Model M. Model M, a class-based n-gram model, has been shown to significantly outperform word-based n-gram models on a variety of domains. In past work, word classes for Model M were induced automatically from unlabeled text using the algorithm of Brown et. al. We take a closer look at the classing and attempt to find out whether improved classing would also translate to improved performance. In particular, we explore the use of manually-assigned classes, part-of-speech (POS) tags, and dialog state information, considering both hard classing and soft classing. In experiments with a conversational dialog system (human--machine dialog) and a speech-to-speech translation system (human--human dialog), we find that better classing can improve Model M performance by up to 3% absolute in word-error rate.

Full Paper

Bibliographic reference.  Sarikaya, Ruhi / Chen, Stanley F. / Sethy, Abhinav / Ramabhadran, Bhuvana (2010): "Impact of word classing on shrinkage-based language models", In INTERSPEECH-2010, 1804-1807.