Though radio and TV broadcast are highly structured documents, state-of-the-art speaker identification algorithms do not take advantage of this information to improve prediction performance: speech turns are usually identified independently from each other, using unstructured multi-class classification approaches. In this work, we propose to address speaker identification as a sequence labeling task and use two structured prediction techniques to account for the inherent temporal structure of interactions between speakers: the first one relies on Conditional Random Field and can take into account local relations between two consecutive speech turns; the second one, based on the Searn framework, sacrifices exact inference for the sake of the expressiveness of the model and is able to incorporate rich structure information during prediction. Experiments performed on The Big Bang Theory TV series show that structured prediction techniques outperform the standard unstructured approach.
Bibliographic reference. Knyazeva, Elena / Wisniewski, Guillaume / Bredin, Hervé / Yvon, François (2015): "Structured prediction for speaker identification in TV series", In INTERSPEECH-2015, 195-199.