Recently we have developed a novel type of structure-based speech recognizer, which uses parameterized, non-recursive "hidden" trajectory model of vocal tract resonances (VTR) or formants to capture the dynamic structure of long-range speech coarticulation and reduction. The underlying model of this recognizer carries out bi-directional FIR filtering on the piecewise constant sequences of the VTR targets. In this paper, we elaborate on two key aspects of the model. First, the phonetic context controls the movement direction and thus the formation of the VTR trajectories. This provides "structured" context dependency for speech acoustics without using context dependent parameters as required by HMMs. Second, VTR targets as the key context-independent parameters of the model vary across speakers. We describe an effective target-value normalization algorithm that can be applied to both training and unknown test speakers. We report experimental results demonstrating the effectiveness of the normalization algorithm in the context of structure-based speech recognition. We also provide computational analysis on the HTM-based speech decoder.
Bibliographic reference. Yu, Dong / Deng, Li / Acero, Alex (2007): "Handling phonetic context and speaker variation in a structure-based speech recognizer", In INTERSPEECH-2007, 906-909.