This paper describes an approach to efficiently derive, and discriminatively train, a weighted finite state transducer (WFST) representation for an articulatory feature-based model of pronunciation. This model is originally implemented as a dynamic Bayesian network (DBN). The work is motivated by a desire to (1) incorporate such a pronunciation model in WFST-based recognizers, and to (2) learn discriminative models that are more general than the DBNs. The approach is quite general, though here we show how it applies to a specific model. We use the conditional independence assumptions imposed by the DBN to efficiently convert it into a sequence of WFSTs (factor FSTs) which, when composed, yield the same model as the DBN. We then introduce a linear model of the arc weights of the factor FSTs and discriminatively learn its weights using the averaged perceptron algorithm. We demonstrate the approach using a lexical access task in which we recognize a word given its surface realization. Our experimental results using a phonetically transcribed subset of the Switchboard corpus show that the discriminatively learned model performs significantly better than the original DBN.
Index Terms: articulatory features, discriminative training, finite state transducers, dynamic Bayesian networks
Bibliographic reference. Jyothi, Preethi / Fosler-Lussier, Eric / Livescu, Karen (2012): "Discriminatively learning factorized finite state pronunciation models from dynamic Bayesian networks", In INTERSPEECH-2012, 1063-1066.