Triplet Network with Attention for Speaker Diarization

Huan Song, Megan Willi, Jayaraman J. Thiagarajan, Visar Berisha, Andreas Spanias


We present our research on continuous speech recognition based on Surface Electromyography (EMG), where speech information is captured by electrodes attached to the speaker's face. This method allows speech processing without requiring that an acoustic signal is present; however, reattachment of the EMG electrodes causes subtle changes in the recorded signal, which degrades the recognition accuracy and thus poses a major challenge for practical application of the system. Based on the growing body of recent work in domain-adversarial training of neural networks, we present a system which adapts the neural network frontend of our recognizer to data from a new reIn automatic speech processing systems, speaker diarization is a crucial front-end component to separate segments from different speakers. Inspired by the recent success of deep neural networks (DNNs) in semantic inferencing, triplet loss-based architectures have been successfully used for this problem. However, existing work utilizes conventional i-vectors as the input representation and builds simple fully connected networks for metric learning, thus not fully leveraging the modeling power of DNN architectures. This paper investigates the importance of learning effective representations from the sequences directly in metric learning pipelines for speaker diarization. More specifically, we propose to employ attention models to learn embeddings and the metric jointly in an end-to-end fashion. Experiments are conducted on the CALLHOME conversational speech corpus. The diarization results demonstrate that, besides providing a unified model, the proposed approach achieves improved performance when compared against existing approaches.cording session, without requiring supervised enrollment.


 DOI: 10.21437/Interspeech.2018-2305

Cite as: Song, H., Willi, M., Thiagarajan, J.J., Berisha, V., Spanias, A. (2018) Triplet Network with Attention for Speaker Diarization. Proc. Interspeech 2018, 3608-3612, DOI: 10.21437/Interspeech.2018-2305.


@inproceedings{Song2018,
  author={Huan Song and Megan Willi and Jayaraman J. Thiagarajan and Visar Berisha and Andreas Spanias},
  title={Triplet Network with Attention for Speaker Diarization},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={3608--3612},
  doi={10.21437/Interspeech.2018-2305},
  url={http://dx.doi.org/10.21437/Interspeech.2018-2305}
}