ISCA Archive Interspeech 2013
ISCA Archive Interspeech 2013

Conditional emission densities for combining speech enhancement and recognition systems

Armin Sehr, Takuya Yoshioka, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Roland Maas, Walter Kellermann

A novel framework based on conditional emission densities for hiddenMarkov models (HMMs) is proposed in this contribution to integrate speech enhancement systems with automatic speech recognition systems. In the training phase, the observed feature vectors, corrupted by background noise and reverberation, together with estimates for the interference as provided by the speech enhancement system are used for training joint densities of the observations and the interference estimates. In the decoding phase, the joint densities are transformed to conditional densities of the observed features given the interference estimates. Thus, front end processing can be exploited for obtaining interference estimates, and the estimation errors can be modeled very effectively in a data-driven way. Connected digit recognition experiments in a simulated reverberant environment show the potential of the proposed approach: HMMs with the proposed conditional densities outperform various configurations of conventional HMMs in the logarithmic melspectral domain. This is a first step towards using conditional densities for creating synergies between front end and back end.


doi: 10.21437/Interspeech.2013-265

Cite as: Sehr, A., Yoshioka, T., Delcroix, M., Kinoshita, K., Nakatani, T., Maas, R., Kellermann, W. (2013) Conditional emission densities for combining speech enhancement and recognition systems. Proc. Interspeech 2013, 3502-3506, doi: 10.21437/Interspeech.2013-265

@inproceedings{sehr13_interspeech,
  author={Armin Sehr and Takuya Yoshioka and Marc Delcroix and Keisuke Kinoshita and Tomohiro Nakatani and Roland Maas and Walter Kellermann},
  title={{Conditional emission densities for combining speech enhancement and recognition systems}},
  year=2013,
  booktitle={Proc. Interspeech 2013},
  pages={3502--3506},
  doi={10.21437/Interspeech.2013-265}
}