A novel framework based on conditional emission densities for hiddenMarkov models (HMMs) is proposed in this contribution to integrate speech enhancement systems with automatic speech recognition systems. In the training phase, the observed feature vectors, corrupted by background noise and reverberation, together with estimates for the interference as provided by the speech enhancement system are used for training joint densities of the observations and the interference estimates. In the decoding phase, the joint densities are transformed to conditional densities of the observed features given the interference estimates. Thus, front end processing can be exploited for obtaining interference estimates, and the estimation errors can be modeled very effectively in a data-driven way. Connected digit recognition experiments in a simulated reverberant environment show the potential of the proposed approach: HMMs with the proposed conditional densities outperform various configurations of conventional HMMs in the logarithmic melspectral domain. This is a first step towards using conditional densities for creating synergies between front end and back end.
Bibliographic reference. Sehr, Armin / Yoshioka, Takuya / Delcroix, Marc / Kinoshita, Keisuke / Nakatani, Tomohiro / Maas, Roland / Kellermann, Walter (2013): "Conditional emission densities for combining speech enhancement and recognition systems", In INTERSPEECH-2013, 3502-3506.