INTERSPEECH 2012

This paper describes an approach to efficiently derive, and discriminatively train, a weighted finite state transducer (WFST) representation for an articulatory featurebased model of pronunciation. This model is originally implemented as a dynamic Bayesian network (DBN). The work is motivated by a desire to (1) incorporate such a pronunciation model in WFSTbased recognizers, and to (2) learn discriminative models that are more general than the DBNs. The approach is quite general, though here we show how it applies to a specific model. We use the conditional independence assumptions imposed by the DBN to efficiently convert it into a sequence of WFSTs (factor FSTs) which, when composed, yield the same model as the DBN. We then introduce a linear model of the arc weights of the factor FSTs and discriminatively learn its weights using the averaged perceptron algorithm. We demonstrate the approach using a lexical access task in which we recognize a word given its surface realization. Our experimental results using a phonetically transcribed subset of the Switchboard corpus show that the discriminatively learned model performs significantly better than the original DBN.
Index Terms: articulatory features, discriminative training, finite state transducers, dynamic Bayesian networks
Bibliographic reference. Jyothi, Preethi / FoslerLussier, Eric / Livescu, Karen (2012): "Discriminatively learning factorized finite state pronunciation models from dynamic Bayesian networks", In INTERSPEECH2012, 10631066.