EUROSPEECH 2003 - INTERSPEECH 2003
8th European Conference on Speech Communication and Technology

Geneva, Switzerland
September 1-4, 2003

        

An Efficient Keyword Spotting Technique Using a Complementary Language for Filler Models Training

Panikos Heracleous (1), Tohru Shimizu (2)

(1) Nara Institute of Science and Technology, Japan
(2) KDDI R&D Laboratories Inc., Japan

The task of keyword spotting is to detect a set of keywords in the input continuous speech. In a keyword spotter, not only the keywords, but also the non-keyword intervals must be modeled. For this purpose, filler (or garbage) models are used. To date, most of the keyword spotters have been based on hidden Markov models (HMM). More specifically, a set of HMM is used as garbage models. In this paper, a two-pass keyword spotting technique based on bilingual hidden Markov models is presented. In the first pass, our technique uses phonemic garbage models to represent the non-keyword intervals, and in the second stage the putative hits are verified using normalized scores. The main difference from similar approaches lies in the way the non-keyword intervals are modeled. In this work, the target language is Japanese, and English was chosen as the `garbage' language for training the phonemic garbage models. Experimental results on both clean and noisy telephone speech data showed higher performance compared with using a common set of acoustic models. Moreover, parameter tuning (e.g. word insertion penalty tuning) does not have a serious effect on the performance. For a vocabulary of 100 keywords and using clean telephone speech test data we achieved a 92.04% recognition rate with only a 7.96% false alarm rate, and without word insertion penalty tuning. Using noisy telephone speech test data we achieved a 87.29% recognition rate with only a 12.71% false alarm rate.

Full Paper

Bibliographic reference.  Heracleous, Panikos / Shimizu, Tohru (2003): "An efficient keyword spotting technique using a complementary language for filler models training", In EUROSPEECH-2003, 921-924.