Auditory-Visual Speech Processing (AVSP) 2013

Annecy, France
August 29 - September 1, 2013

Speaker Separation using Visually-Derived Binary Masks

Faheem Khan, Ben Milner

School of Computing Sciences, University of East Anglia, Norwich, UK

This paper is concerned with the problem of single-channel speaker separation and exploits visual speech information to aid the separation process. Audio from a mixture of speakers is received from a single microphone and to supplement this, video from each speaker in the mixture is also captured. The visual features are used to create a time-frequency binary mask that identifies regions where the target speaker dominates. These regions are retained and form the estimate of the target speaker’s speech. Experimental results compare the visually-derived binary masks with ideal binary masks which shows a useful level of accuracy. The effectiveness of the visually-derived binary mask for speaker separation is then evaluated through estimates of speech quality and speech intelligibility and shows substantial gains over the original mixture.

Index Terms: Speaker separation, binary masks, visual features, audio-visual correlation

Full Paper

Bibliographic reference.  Khan, Faheem / Milner, Ben (2013): "Speaker separation using visually-derived binary masks", In AVSP-2013, 215-220.