ISCA Archive AVSP 2013
ISCA Archive AVSP 2013

Speaker separation using visually-derived binary masks

Faheem Khan, Ben Milner

This paper is concerned with the problem of single-channel speaker separation and exploits visual speech information to aid the separation process. Audio from a mixture of speakers is received from a single microphone and to supplement this, video from each speaker in the mixture is also captured. The visual features are used to create a time-frequency binary mask that identifies regions where the target speaker dominates. These regions are retained and form the estimate of the target speaker’s speech. Experimental results compare the visually-derived binary masks with ideal binary masks which shows a useful level of accuracy. The effectiveness of the visually-derived binary mask for speaker separation is then evaluated through estimates of speech quality and speech intelligibility and shows substantial gains over the original mixture.

Index Terms: Speaker separation, binary masks, visual features, audio-visual correlation

Cite as: Khan, F., Milner, B. (2013) Speaker separation using visually-derived binary masks. Proc. Auditory-Visual Speech Processing, 215-220

  author={Faheem Khan and Ben Milner},
  title={{Speaker separation using visually-derived binary masks}},
  booktitle={Proc. Auditory-Visual Speech Processing},