A speech signal captured by a distant microphone is generally contaminated by background noise, which severely degrades the audible quality and intelligibility of the observed signal. To resolve this issue, speech enhancement has been intensively studied. In this paper, we consider a text-informed speech enhancement, where the enhancement process is guided by the corresponding text information, i.e., a correct transcription of the target utterance. The proposed deep neural network (DNN)-based framework is motivated by the recent success in the text-to-speech (TTS) research in employing DNN as well as high audible-quality output signal of the corpus-based speech enhancement which borrows knowledge from the TTS research field. Taking advantage of the nature of DNN that allows us to utilize disparate features in an inference stage, the proposed method infers the clean speech features by jointly using the observed signal and widely-used TTS features derived from the corresponding text. In this paper, we first introduce the background and the details of the proposed method. Then, we show how the text information can be naturally integrated into speech enhancement by utilizing DNN and improve the enhancement performance.
Bibliographic reference. Kinoshita, Keisuke / Delcroix, Marc / Ogawa, Atsunori / Nakatani, Tomohiro (2015): "Text-informed speech enhancement with deep neural networks", In INTERSPEECH-2015, 1760-1764.