Odyssey 2010: The Speaker and Language Recognition Workshop
Brno, Czech Republic
This paper addresses the problem of language distribution estimation from unlabeled data. We present a new algorithm that treats automated classifier identification outputs as likelihoods and iteratively applies Bayes' rule to reclassify the data using successively improving distribution estimates as "priors". Experimental results using the MIT LL submission to the NIST LRE07 evaluation show significant improvements in estimation of non-uniform distributions as compared to a baseline counting approach. In addition, we show how to incorporate these estimated distributions into the classification task. Further experiments on the LRE07 corpus show large gains for both the detection/verification and identification tasks when only a small set of languages are actually present in the test set.
Full Paper (PDF)
Bibliographic reference. McCree, Alan (2010): "Estimating and Exploiting Language Distributions of Unlabeled Data", In Odyssey-2010, paper 036.