Symposium on Machine Learning in Speech and Language Processing (MLSLP)
Portland, Oregon, USA
Using the central observation that margin-based weighted classification error (modeled using Minimum Phone Error (MPE)) corresponds to the derivative with respect to the margin term of margin-based hinge loss (modeled using ``boosted'' Maximum Mutual Information (bMMI)), the differenced Maximum Mutual Information (dMMI) approach subsumes and extends margin-based MPE and bMMI within a broader framework in which the objective function is an integral of MPE loss over a range of margin values. Applying the Fundamental Theorem of Calculus, this integral is easily evaluated using finite differences of bMMI functionals. Practical lattice-based training using the new criterion can then be carried out using differences of bMMI gradients. Experimental results for training of Gaussian Mixture Model (GMM) based hidden Markov models (HMMs) using dMMI on Large Vocabulary Continuous Speech Recognition tasks show that dMMI with the right margin interval can recover the expected MPE or bMMI performance [McDermott et al. ICASSP 2010]; results for dMMI-trained feature transformations suggest that suitably chosen margin intervals can improve over the corresponding MPE- or bMMI- trained transformations [Delcroix et al. ICASSP 2012]. One consequence of these findings is that dMMI can be used to implement a close approximation of MPE without recourse to the modified Baum-Welch algorithm [Povey 2002], using a simple difference of bMMI functionals instead; conversely dMMI can be used to verify that a given MPE implementation is correct, as it must match dMMI results for a narrow margin interval centered on the origin. Finally, dMMI can be used as the basis for a Bayesian framework where the margin-modified cost function is integrated over a general margin prior, approximated as a sum of dMMI functionals [McDermott, Acoustical Society of Japan, Spring 2010]. The Error-indexed Forward-Backward algorithm [McDermott et al. Interspeech 2008], which aggregates occupancies of lattice word strings by equal error count, can be used to visualize the way in which MPE and bMMI are special cases of the more general dMMI.
Bibliographic reference. McDermott, Erik (2012): "An integrated framework for ``margin'' based sequential discriminative training over lattices based on differenced maximum mutual information (dMMI)", In MLSLP-2012 (abstract).