ISCA Workshop on
This paper examines the problem of distant microphone speech recognition in noisy indoor home environments. The noise background can be roughly characterised in terms of a slowly varying noise floor in which there are embedded a mixture of energetic but unpredictable acoustic events. Our solution to the problem combines two complementary techniques. First, a soft missing data mask is formed which estimates the degree to which energetic acoustic events are masked by the noise floor. This step relies on a simple adaptive noise model. Second, a fragment decoding system attempts to interpret the energetic regions that are not accounted for by the noise floor model. This component uses models of the target speech to decide whether fragments (time-frequency regions dominated by a single sound source) should be included in the target speech stream or not. This combined approach is able to achieve a performance that is modestly superior to that achieved using speech fragment decoding without an adaptive noise floor. Our experiments also show that speech fragment decoding performs far better than soft missing data decoding in variable noise, achieving 73% keyword recognition accuracy at -6 dB SNR on the Grid corpus task and substantially outperforming multicondition training.
Index Terms: Noise robust speech recognition; Fragment decoding; Missing data; Reverberation
Bibliographic reference. Ma, Ning / Barker, Jon / Christensen, Heidi / Green, Phil (2010): "Distant microphone speech recognition in a noisy indoor environment: combining soft missing data and speech fragment decoding", In SAPA-2010, 19-24.