11th Annual Conference of the International Speech Communication Association

Makuhari, Chiba, Japan
September 26-30. 2010

Incorporating Sparse Representation Phone Identification Features in Automatic Speech Recognition Using Exponential Families

Vaibhava Goel, Tara N. Sainath, Bhuvana Ramabhadran, Peder Olsen, David Nahamoo, Dimitri Kanevsky

IBM T.J. Watson Research Center, USA

Sparse representation phone identification features (SPIF) is a recently developed technique to obtain an estimate of phone posterior probabilities conditioned on an acoustic feature vector. In this paper, we explore incorporating SPIF phone posterior probability estimates in large vocabulary continuous speech recognition (LVCSR) task by including them as additional features of exponential densities that model the HMM state emission likelihoods. We compare our proposed approach to a number of other well known methods of combining feature streams or multiple LVCSR systems. Our experiments show that using exponential models to combine features results in a word error rate reduction of 0.5% absolute (18.7% down to 18.2%); this is comparable to best error rate reduction obtained from system combination methods, but without having to build multiple systems or tune the system combination weights.

Full Paper

Bibliographic reference.  Goel, Vaibhava / Sainath, Tara N. / Ramabhadran, Bhuvana / Olsen, Peder / Nahamoo, David / Kanevsky, Dimitri (2010): "Incorporating sparse representation phone identification features in automatic speech recognition using exponential families", In INTERSPEECH-2010, 1345-1348.