We describe a new approach to automatic speaker recognition based in explicit modeling of temporal contours in linguistic units (TCLU). Inspired in successful work in forensic speaker identification, we extend the approach to design a fully automatic system, with a high potential for combination with spectral systems. Using SRI's Decipher phone, word and syllabic labels, we have tested up to 468 unit-based subsystems from 6 groups of lexically-determined units, namely phones, diphones, triphones, center phone in triphones, syllables and words, subsystems being combined at the score level. Evaluating with NIST SRE04 English-only 1s1s, their hierarchical fusion gives an EER of 4.20% (minDCF=0.018) from automatic formant tracking of conversational telephone speech. Combining extremely well with a Joint Factor Analysis system (from JFA EER of 4.25% to 2.47%, minDCF from 0.020 to 0.012), extensions as more robust prosodic or spectral features are likely to further improve this approach.
Bibliographic reference. Gonzalez-Rodriguez, Joaquin (2011): "Speaker recognition using temporal contours in linguistic units: the case of formant and formant-bandwidth trajectories", In INTERSPEECH-2011, 133-136.