Normally, we represent speech as a long sequence of frames and model the keyword with a relatively small set of parameters, commonly with a hidden Markov model (HMM). However, since the input speech is much longer than the keyword, suppose instead that we represent the speech as a relatively sparse set of impulses (roughly one per phoneme) and model the keyword as a filter-bank, where each filter's impulse response relates to the likelihood of a phone at a given position within a word. Evaluating keyword detections can then be seen of as a convolution of an impulse train with an array of filters. This view enables huge speedups; runtime no longer depends on the frame rate and is instead linear in the number of events (impulses). We apply this intuition to redesign the runtime engine behind of the point process model for keyword spotting. We demonstrate impressive real-time speedups (500,000 times faster than real-time) with minimal loss in search accuracy.
Index Terms: keyword spotting, point process model
Bibliographic reference. Kintzley, Keith / Jansen, Aren / Church, Kenneth / Hermansky, Hynek (2012): "Inverting the point process model for fast phonetic keyword search", In INTERSPEECH-2012, 2438-2441.