Out-of-Set i-Vector Selection for Open-set Language Identification

Hamid Behravan, Tomi Kinnunen, Ville Hautamäki


Current language identification (LID) systems are based on an i-vector classifier followed by a multi-class recognition back-end. Identification accuracy degrades considerably when LID systems face open-set data. In this study, we propose an approach to the problem of out of set (OOS) data detection in the context of open-set language identification. In our approach, each unlabeled i-vector in the development set is given a per-class outlier score computed with the help of non-parametric Kolmogorov-Smirnov (KS) test. Detected OOS data from unlabeled development set is then used to train an additional model to represent OOS languages in the back-end. The proposed approach achieves a relative decrease of 16% in equal error rate (EER) over classical OOS detection methods, in discriminating in-set and OOS languages. Using support vector machine (SVM) as language back-end classifier, integrating the proposed method to the LID back-end yields 15% relative decrease in identification cost in comparison to using all the development set as OOS candidates.


DOI: 10.21437/Odyssey.2016-44

Cite as

Behravan, H., Kinnunen, T., Hautamäki, V. (2016) Out-of-Set i-Vector Selection for Open-set Language Identification. Proc. Odyssey 2016, 303-310.

Bibtex
@inproceedings{Behravan+2016,
author={Hamid Behravan and Tomi Kinnunen and Ville Hautamäki},
title={Out-of-Set i-Vector Selection for Open-set Language Identification},
year=2016,
booktitle={Odyssey 2016},
doi={10.21437/Odyssey.2016-44},
url={http://dx.doi.org/10.21437/Odyssey.2016-44},
pages={303--310}
}