9th Annual Conference of the International Speech Communication Association

Brisbane, Australia
September 22-26, 2008

Towards Vocabulary-Independent Speech Indexing for Large-Scale Repositories

Jian Shao (1), Roger Peng Yu (1), Qingwei Zhao (2), Yonghong Yan (2), Frank Seide (1)

(1) Microsoft Research Asia, China; (2) Chinese Academy of Sciences, China

The Out-Of-Vocabulary problem remains a challenge for wordlattice- based speech indexing. Sub-word-based approaches address this problem effectively for small-scale tasks, but suffer from poor precisions on large-scale databases due to lack of strong language model constraints. We propose a method for searching OOV queries with large-scale databases in two steps. First, result candidates are extracted from a sub-word-based system, ensuring a high recall. The candidates are then refined by word-lattice rescoring aiming at a high precision. Experiments on a 160-hours lecture set show that the proposed approach achieves a relative improvement of 8.7% over the sub-word-based baseline, and 19.7% for only single-word queries.

Full Paper

Bibliographic reference.  Shao, Jian / Yu, Roger Peng / Zhao, Qingwei / Yan, Yonghong / Seide, Frank (2008): "Towards vocabulary-independent speech indexing for large-scale repositories", In INTERSPEECH-2008, 2150-2153.