This paper describes a new discriminative HMM parameter estimation technique. It supplements the usual ML optimization function with the emission (accept) likelihood of the aligned state (phone) and the rejection likelihoods from the rest of the states (phones). Intuitively, this new optimization function takes into the account as to how well the other states are rejecting the current frame that has been aligned with a given state. This simple scheme, termed as Maximum Accept and Reject (MARS), implicitly brings in the discriminative information and hence performs better than the ML trained models. As is well known, maximum mutual information (MMI)[3, 4] training needs a language model (lattice), encoding all possible sentences[7, 9], that could occur in the test conditions. MMI training uses this language model (lattice) to identify the confusable segments of speech in the form of the so-called "denominator" state occupation statistics . However, this implicitly ties the MMI trained acoustic model to a particular task-domain. MARS training does not face this constraint as it finds the confusable states at the frame level and hence does not use a language model (lattice) during training.
Bibliographic reference. Tyagi, Vivek (2008): "Maximum accept and reject (MARS) training of HMM-GMM speech recognition systems", In INTERSPEECH-2008, 956-959.