Talker localization is indispensable in video conferencing. Statistical audio-visual (AV) talker localizers that fuse AV features based on prior statistical property are ideals. However, statistical property must be estimated prior to the AV feature fusion procedure. To overcome this problem, this paper proposes a novel robust and omnidirectional AV talker localizer that dynamically fuses AV features based on validity and reliability criteria for eliminating prior statistical property. Direction estimation of speech arriving using equilateral triangular microphone array and human position detection using an omnidirectional video camera extract AV features from captured AV signals. Validity criterion, called audio- or visual-localization counter, validates both features. Reliability criterion, called evaluator of directional-speech arriving, acts as weight for dynamic AV feature fusion. The results of talker localization experiments in an actual office room confirmed that the proposed AV localizer based on dynamic feature fusion is superior to that of the conventional localizer that utilizes either audio or visual features.
Bibliographic reference. Denda, Yuki / Nishiura, Takanobu / Yamashita, Yoichi (2007): "Omnidirectional audio-visual talker localizer with dynamic feature fusion based on validity and reliability criteria", In INTERSPEECH-2007, 726-729.