Odyssey 2010: The Speaker and Language Recognition Workshop

Brno, Czech Republic
28 June 1 July 2010

Estimating and Exploiting Language Distributions of Unlabeled Data

Alan McCree (1)

(1) MIT Lincoln Laboratory

This paper addresses the problem of language distribution estimation from unlabeled data. We present a new algorithm that treats automated classifier identification outputs as likelihoods and iteratively applies Bayes' rule to reclassify the data using successively improving distribution estimates as "priors". Experimental results using the MIT LL submission to the NIST LRE07 evaluation show significant improvements in estimation of non-uniform distributions as compared to a baseline counting approach. In addition, we show how to incorporate these estimated distributions into the classification task. Further experiments on the LRE07 corpus show large gains for both the detection/verification and identification tasks when only a small set of languages are actually present in the test set.

Full Paper (PDF)

Bibliographic reference.  McCree, Alan (2010): "Estimating and Exploiting Language Distributions of Unlabeled Data", In Odyssey-2010, paper 036.