Interspeech'2005 - Eurospeech
The ‘missing data' approach for robust speech recognition uses masks indicating which regions of an acoustic mixture provide reliable evidence of the target to be recognised. Binaural cues for spatial location were used to determine missing data masks for signals consisting of utterances from three concurrent male speakers in reverberant conditions, by deriving probability distributions from estimates of interaural time and level differences (ITD and ILD) for the mixed signals. In such a system, a decision must be made about whether the acoustic features used for decoding are selected from the left or right ear, or a combination of the two. Here, features were selected from the "better ear" (as determined by a simple heuristic) within whole time frames, or within individual time-frequency elements. A combination of left and right ear features gave better recognition performance than using either ear alone, and the best results were obtained when selecting features within individual time-frequency elements.
Bibliographic reference. Harding, Sue / Barker, Jon / Brown, Guy J. (2005): "Binaural feature selection for missing data speech recognition", In INTERSPEECH-2005, 1269-1272.