According to articulatory phonology, the gestural score is an invariant speech representation. Though the timing schemes, i.e., the onsets and offsets, of the gestural activations may vary, the ensemble of these activations tends to remain unchanged, informing the speech content. In this work, we propose a pronunciation modeling method that uses a finite state machine to represent the invariance of a gestural score. Given the "canonical'' gestural score of a word with a known activation timing scheme, the plausible activation onsets and offsets are recursively generated and encoded as a weighted FSM. Speech recognition is achieved by matching the recovered gestural activations to the FSM-encoded gestural scores of different speech contents. We carry out pilot word classification experiments using synthesized data from one speaker. The proposed pronunciation modeling achieves over 90% accuracy for a vocabulary of 139 words with no training observations, outperforming direct use of the "canonical'' gestural scores.
Bibliographic reference. Hu, Chi / Zhuang, Xiaodan / Hasegawa-Johnson, Mark (2010): "FSM-based pronunciation modeling using articulatory phonological code", In INTERSPEECH-2010, 2274-2277.