One of the most challenging problems in speech recognition is to deal with inevitable acoustic variations caused by non-linguistic factors. Recently, an invariant structural representation of speech was proposed , where the non-linguistic variations are effectively removed though modeling the dynamic and contrastive aspects of speech signals. This paper describes our recent progresses on this problem. Theoretically, we prove that the maximum likelihood based decomposition can lead to the same structural representations for a sequence and its transformed version. Practically, we introduce a method of discriminant analysis of eigen-structure to deal with two limitations of structural representations, namely, high dimensionality and too strong invariance. In the 1st experiment, we evaluate the proposed method through recognizing connected Japanese vowels. The proposed method achieves a recognition rate 99.0%, which is higher than those of the previous structure based recognition methods [2, 3, 4] and word HMM. In the 2nd experiment, we examine the recognition performance of structural representations to vocal tract length (VTL) differences. The experimental results indicate that structural representations have much more robustness to VTL changes than HMM. Moreover, the proposed method is about 60 times faster than the previous ones.
Bibliographic reference. Qiao, Yu / Minematsu, Nobuaki / Hirose, Keikichi (2009): "On invariant structural representation for speech recognition: theoretical validation and experimental improvement", In INTERSPEECH-2009, 3055-3058.