Odyssey 2008: The Speaker and Language Recognition Workshop

Stellenbosch, South Africa
January 21-24, 2008

Building Language Detectors using Small Amounts of Training Data

David A. van Leeuwen (1), Niko Brümmer (2)

(1) TNO Human Factors, Soesterberg, the Netherlands
(2) Spescom Datavoice, Stellenbosch, South Africa

In this paper we present language detectors built using relatively small amounts of training data. This is carried out using the modelling power of a Linear Discriminant Analysis back-end for the languages which have a small amount of training data. We present experiments on NIST 2005 Language Recognition Evaluation data, where we use a jackknifing technique to remove welltrained language knowledge from the LDA back-end, using only sparse trials for training the LDA. We investigate three systems, which show different levels of loss of language detection capability. We validate the technique on an independent collection of 21 languages, where we show that with less than one hour training we obtain an error rate for ‘new’ languages that is only slightly over twice the error rate for languages for which the full 60 hours of CallFriend data is available.

Full Paper     Presentation (PDF)

Bibliographic reference.  Leeuwen, David A. van / Brümmer, Niko (2008): "Building language detectors using small amounts of training data", In Odyssey-2008, paper 015.