Text-Available Speaker Recognition System for Forensic Applications

Chengzhu Yu, Chunlei Zhang, Finnian Kelly, Abhijeet Sangwan, John H.L. Hansen


This paper examines a text-available speaker recognition approach targeting scenarios where the transcripts of test utterances are either available or obtainable through manual transcription. Forensic speaker recognition is one of such applications where the human supervision can be expected. In our study, we extend an existing Deep Neural Network (DNN) i-vector-based speaker recognition system to effectively incorporate text information associated with test utterances. We first show experimentally that speaker recognition performance drops significantly if the DNN output posteriors are directly replaced with their target senone, obtained from force alignment. The cause of such performance drops can be attributed to the fact that forced alignment selects only the single most probable senone as their output, which is not desirable in a current speaker recognition framework. To resolve this problem, we propose a posterior mapping approach where the relationship between forced aligned senones and its corresponding DNN posteriors are modeled. By replacing DNN output posteriors with senone mapped posteriors, a robust text-available speaker recognition system can be obtained in mismatched environments. Experiments using the proposed approach are performed on the Aurora-4 dataset.


DOI: 10.21437/Interspeech.2016-1520

Cite as

Yu, C., Zhang, C., Kelly, F., Sangwan, A., Hansen, J.H. (2016) Text-Available Speaker Recognition System for Forensic Applications. Proc. Interspeech 2016, 1844-1847.

Bibtex
@inproceedings{Yu+2016,
author={Chengzhu Yu and Chunlei Zhang and Finnian Kelly and Abhijeet Sangwan and John H.L. Hansen},
title={Text-Available Speaker Recognition System for Forensic Applications},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-1520},
url={http://dx.doi.org/10.21437/Interspeech.2016-1520},
pages={1844--1847}
}