Distant speech recognition (DSR) remains to be an open challenge, even for the state-of-the-art deep neural network (DNN) models. Previous work has attempted to improve DNNs under constantly distant speech. However, in real applications, the speaker-microphone distance (SMD) can be quite dynamic, varying even within a single utterance. This paper investigates how to alleviate the impact of dynamic SMD on DNN models. Our solution is to incorporate the frame-level SMD information into DNN training. Generation of the SMD information relies on a universal extractor that is learned on a meeting corpus. We study the utility of different architectures in instantiating the SMD extractor. On our target acoustic modeling task, two approaches are proposed to build distance-aware DNN models using the SMD information: simple concatenation and distance adaptive training (DAT). Our experiments show that in the simplest case, incorporating the SMD descriptors improves word error rates of DNNs by 5.6% relative. Further optimizing SMD extraction and integration results in more gains.
Bibliographic reference. Miao, Yajie / Metze, Florian (2015): "Distance-aware DNNs for robust speech recognition", In INTERSPEECH-2015, 761-765.