Optimization of Speech Enhancement Front-End with Speech Recognition-Level Criterion

Takuya Higuchi, Takuya Yoshioka, Tomohiro Nakatani

This paper concerns the use of speech enhancement to improve automatic speech recognition (ASR) performance in noisy environments. Speech enhancement systems are usually designed separately from a back-end recognizer by optimizing the front-end parameters with signal-level criteria. Such a disjoint processing approach is not always useful for ASR. Indeed, time-frequency masking, which is widely used in the speech enhancement community, sometimes degrades the ASR performance because of the artifacts created by masking. This paper proposes a speech recognition-oriented front-end approach that optimizes the front-end parameters with an ASR-level criterion, where we use a complex Gaussian mixture model (CGMM) for mask estimation. First, the process of CGMM-based time-frequency masking is reformulated as a computation network. By connecting this CGMM network to the input layer of the acoustic model, the CGMM parameters can be optimized for each test utterance by back propagation using an unsupervised acoustic model adaptation scheme. Experimental results show that the proposed method achieves a relative improvement of 7.7% on the CHiME-3 evaluation set in terms of word error rate.

DOI: 10.21437/Interspeech.2016-681

Cite as

Higuchi, T., Yoshioka, T., Nakatani, T. (2016) Optimization of Speech Enhancement Front-End with Speech Recognition-Level Criterion. Proc. Interspeech 2016, 3808-3812.

author={Takuya Higuchi and Takuya Yoshioka and Tomohiro Nakatani},
title={Optimization of Speech Enhancement Front-End with Speech Recognition-Level Criterion},
booktitle={Interspeech 2016},