9th Annual Conference of the International Speech Communication Association

Brisbane, Australia
September 22-26, 2008

Effective Acoustic Adaptation for a Distant-Talking Interactive TV System

Jing Huang (1), Mark Epstein (1), Marco Matassoni (2)

(1) IBM T.J. Watson Research Center, USA; (2) FBK-irst, Italy

In this paper we have studied how to adapt a close-talking baseline acoustic model to a distant-talking application developed in an interactive TV dialogue system: distant-talking interfaces for control of interactive TV (DICIT) project. We have shown that in order to have effective adaptation from the out-of-domain data it is better to acquire that data in the same DICIT environment than using contaminated data. By measuring grammar error rate (GER) and action classification error rate (AER) in addition to word error rate (WER), we have shown the best way to adapt the baseline model using available out-of-domain adaptation data (TIMIT) and small amount of in-domain (DICIT) adaptation data. The best approach is to use cascading MAP adaptation. With less than 5 hours of out-of-domain data and 1 hour of in-domain data, the cascading MAP improves WER/GER/AER by 17%/18%/16% relative respectively over the baseline model. The experimental results show that in-domain adaptation data is definitely needed to improve GER and AER.

Full Paper

Bibliographic reference.  Huang, Jing / Epstein, Mark / Matassoni, Marco (2008): "Effective acoustic adaptation for a distant-talking interactive TV system", In INTERSPEECH-2008, 1709-1712.