This paper investigates the use of frame alignment given by a deep neural network (DNN) for text-constrained speaker verification task, where the lexical contents of the test utterances are limited to a finite set of vocabulary. The DNN makes use of information carried by the target and its contextual frames to assign it probabilistically to one of the phonetic states. The frame alignment is therefore more precise and less ambiguous than that generated by a Gaussian mixture model (GMM). Using the DNN alignment, we show that an i-vector can be decomposed into segments of local variability vectors, each corresponding to a monophone, where each local vector models session variability given the phonetic context. Based on the local vectors, the content matching between the utterances for comparison can be accomplished in the PLDA scoring. Experiments conducted on the RSR2015 database shows that the proposed phone-centric local variability vector achieves a better performance compared to the i-vector.
Bibliographic reference. Chen, Liping / Lee, Kong Aik / Ma, Bin / Guo, Wu / Li, Haizhou / Dai, Li-Rong (2015): "Phone-centric local variability vector for text-constrained speaker verification", In INTERSPEECH-2015, 229-233.