16th Annual Conference of the International Speech Communication Association

Dresden, Germany
September 6-10, 2015

Phone-Centric Local Variability Vector for Text-Constrained Speaker Verification

Liping Chen (1), Kong Aik Lee (2), Bin Ma (2), Wu Guo (1), Haizhou Li (2), Li-Rong Dai (1)

(1) USTC, China
(2) A*STAR, Singapore

This paper investigates the use of frame alignment given by a deep neural network (DNN) for text-constrained speaker verification task, where the lexical contents of the test utterances are limited to a finite set of vocabulary. The DNN makes use of information carried by the target and its contextual frames to assign it probabilistically to one of the phonetic states. The frame alignment is therefore more precise and less ambiguous than that generated by a Gaussian mixture model (GMM). Using the DNN alignment, we show that an i-vector can be decomposed into segments of local variability vectors, each corresponding to a monophone, where each local vector models session variability given the phonetic context. Based on the local vectors, the content matching between the utterances for comparison can be accomplished in the PLDA scoring. Experiments conducted on the RSR2015 database shows that the proposed phone-centric local variability vector achieves a better performance compared to the i-vector.

Full Paper

Bibliographic reference.  Chen, Liping / Lee, Kong Aik / Ma, Bin / Guo, Wu / Li, Haizhou / Dai, Li-Rong (2015): "Phone-centric local variability vector for text-constrained speaker verification", In INTERSPEECH-2015, 229-233.