9th Annual Conference of the International Speech Communication Association

Brisbane, Australia
September 22-26, 2008

Addressing the Out-of-Vocabulary Problem for Large-Scale Chinese Spoken Term Detection

Sha Meng (1), Jian Shao (2), Roger Peng Yu (2), Jia Liu (1), Frank Seide (2)

(1) Tsinghua University, China; (2) Microsoft Research Asia, China

While the Out-Of-Vocabulary (OOV) problem remains a challenge for English spoken term detection tasks, it is under-estimated for Chinese. This is because an Chinese OOV query term can still be matched as a sequence of Chinese characters, with each character itself being a word in the vocabulary. However, our experiments show that search accuracy levels differ significantly when a query is or is not in the vocabulary. In-Vocabulary (INV) queries outperform OOV queries for more than 20%. We examine this problem with a word-lattice-based spoken term detection task. We propose a two-stage method by first locating candidates by partial phonetic matching and then refining the matching score with word lattice rescoring. Experiments show that the proposed method achieves a 24.1% relative improvement for OOV queries on a large-scale Chinese spoken term detection task.

Full Paper

Bibliographic reference.  Meng, Sha / Shao, Jian / Yu, Roger Peng / Liu, Jia / Seide, Frank (2008): "Addressing the out-of-vocabulary problem for large-scale Chinese spoken term detection", In INTERSPEECH-2008, 2146-2149.