Since its introduction in 2019, the whole end-to-end neural diarization (EEND) line of work has been addressing speaker diarization as a frame-wise multi-label classification problem with permutation-invariant training. Despite EEND showing great promise, a few recent works took a step back and studied the possible combination of (local) supervised EEND diarization with (global) unsupervised clustering. Yet, these hybrid contributions did not question the original multi-label formulation. We propose to switch from multi-label (where any two speakers can be active at the same time) to powerset multi-class classification (where dedicated classes are assigned to pairs of overlapping speakers). Through extensive experiments on 9 different benchmarks, we show that this formulation leads to significantly better performance (mostly on overlapping speech) and robustness to domain mismatch, while eliminating the detection threshold hyper-parameter, critical for the multi-label formulation.
Cite as: Plaquet, A., Bredin, H. (2023) Powerset multi-class cross entropy loss for neural speaker diarization. Proc. INTERSPEECH 2023, 3222-3226, doi: 10.21437/Interspeech.2023-205
@inproceedings{plaquet23_interspeech, author={Alexis Plaquet and Hervé Bredin}, title={{Powerset multi-class cross entropy loss for neural speaker diarization}}, year=2023, booktitle={Proc. INTERSPEECH 2023}, pages={3222--3226}, doi={10.21437/Interspeech.2023-205} }