We propose a framework that leverages articulatory phonology for speech recognition. “Gestural pattern vectors” (GPV) encode the instantaneous gestural activations that exist across all tract variables at each time. Given a speech observation, recognizing the sequence of GPV recovers the ensemble of gestural activations, i.e., the gestural score. For each word in the vocabulary, we use a task dynamic model of inter-articulator speech coordination to generate the “canonical” gestural score. Speech recognition is achieved by matching the ensemble of gestural activations. In particular, we estimate the likelihood of the recognized GPV sequence on word-dependent GPV sequence models trained using the “canonical” gestural scores. These likelihoods, weighted by confidence score of the recognized GPVs, are used in a Bayesian speech recognizer.
Pilot gestural score recovery and word classification experiments are carried out using synthesized data from one speaker. The observation distribution of each GPV is modeled by an artificial neural network and Gaussian mixture tandem model. Bigram GPV sequence models are used to distinguish gestural scores of different words. Given the tract variable time functions, about 80% of the instantaneous gestural activation is correctly recovered. Word recognition accuracy is over 85% for a vocabulary of 139 words with no training observations. These results suggest that the proposed framework might be a viable alternative to the classic sequence-of-phones model.
Bibliographic reference. Zhuang, Xiaodan / Nam, Hosung / Hasegawa-Johnson, Mark / Goldstein, Louis / Saltzman, Elliot (2009): "Articulatory phonological code for word classification", In INTERSPEECH-2009, 2763-2766.