We propose a simple and effective strategy to cope with dataset shifts in text-dependent speaker recognition based on Joint Factor Analysis (JFA). We have previously shown how to compensate for lexical variation in text-dependent JFA by adapting the Universal Background Model (UBM) to individual passphrases. A similar type of adaptation can be used to port a JFA model trained on out-of-domain data to a given text-dependent task domain. On the RSR2015 test set we found that this type of adaptation gave essentially the same results as in-domain JFA training. To explore this idea more fully, we experimented with several types of JFA model on the CSLU speaker recognition dataset. Taking a suitably configured JFA model trained on NIST data and adapting it in the proposed way results in a 22% reduction in error rates compared with the GMM/UBM benchmark. Error rates are still much higher than those that can be achieved on the RSR2015 test set with the same strategy but cheating experiments suggest that if large amounts of in-domain training data are available, then JFA modelling is capable in principle of achieving very low error rates even on hard tasks such as CSLU.
Bibliographic reference. Kenny, Patrick / Stafylakis, Themos / Alam, M. J. / Ouellet, Pierre / Kockmann, Marcel (2014): "In-domain versus out-of-domain training for text-dependent JFA", In INTERSPEECH-2014, 1332-1336.