This paper proposes a novel speech-fragment based approach for processing binaural data to improve the estimation of speech source locations in reverberant, multi-speaker recordings. The technique employs two stages. First, a robust multi-pitch tracking algorithm is used to locate local spectro-temporal ‘speech fragments’ - regions where the energy in the mixture is dominated by a single speech source. Second, robust localisation estimates are formed by integrating interaural time difference cues over each speech fragment. The technique is applied to the analysis of more than five hours of two-party meetings that have been constructed from a mixture of binaural mannequin recordings. It is shown that estimating location at the speech fragment level produces better results than conventional location-estimate smoothing techniques leading to a an increase in relative frame accuracy rate of more than 35%.
Bibliographic reference. Christensen, Heidi / Ma, Ning / Wrigley, Stuart N. / Barker, Jon (2007): "Integrating pitch and localisation cues at a speech fragment level", In INTERSPEECH-2007, 2769-2772.