Speaker-Corrupted Embeddings for Online Speaker Diarization

Omid Ghahabi, Volker Fischer

Speaker diarization is more challenging in presence of background noise or music, frequent speaker changes, and cross talks. In an online scenario, the decision should be made at time, given only the current short segment and the speakers detected in the past, which makes the task even harder. In this work, an online robust speaker diarization algorithm is proposed in which speech segments are represented by low dimensional vectors referred to as speaker-corrupted embeddings. The proposed speaker embedding network is a deep neural network which takes speaker-corrupted supervectors as input, uses variable ReLU (VReLU) as an activation function, and tries to discriminate the background speakers. Speaker corruption is performed by adding supervectors built by 20 speech frames from other speakers to the supervectors of a given speaker. It is shown that speaker corruption, VReLU, and input dropout increase the generalization power of the proposed network. To increase the robustness, the proposed embeddings are concatenated with LDA transformed supervectors. Experimental results on the Albayzin 2018 evaluation set show a competitive accuracy, more robustness, and much lower computational cost compared to typical offline algorithms.

 DOI: 10.21437/Interspeech.2019-2756

Cite as: Ghahabi, O., Fischer, V. (2019) Speaker-Corrupted Embeddings for Online Speaker Diarization. Proc. Interspeech 2019, 386-390, DOI: 10.21437/Interspeech.2019-2756.

  author={Omid Ghahabi and Volker Fischer},
  title={{Speaker-Corrupted Embeddings for Online Speaker Diarization}},
  booktitle={Proc. Interspeech 2019},