Odyssey 2012 - The Speaker and Language Recognition Workshop
The baseline approach in building phonotactic language recognition systems is to characterize each language by a single phonotactic model generated from all the available languagespecific training data. When several data sources are available for a given target language, system performance can be improved using language source-dependent phonotactic models. In this case, the common practice is to fuse language source information (i.e., the phonotactic scores for each language/ source) early (at the input) to the backend. This paper proposes to postpone the fusion to the end (at the output) of the backend. In this case, the language recognition score can be estimated from well-calibrated language source scores.
Experiments were conducted using the NIST LRE 2007 and the NIST LRE 2009 evaluation data sets with the 30s condition. On the NIST LRE 2007 eval data, a Cavg of 0.9% is obtained for the closed-set task and 2.5% for the open-set task. Compared to the common practice of early fusion, these results represent relative improvements of 18% and 11%, for the closed-set and open-set tasks, respectively. Initial tests on the NIST LRE 2009 eval data gave no improvement on the closedset task. Moreover, the Cllr measure indicates that language recognition scores estimated by the proposed approach are better calibrated than the common practice (early fusion).
Bibliographic reference. BenZeghiba, Mohamed Faouzi / Gauvain, Jean-Luc / Lamel, Lori (2012): "Fusing language information from diverse data sources for phonotactic language recognition", In Odyssey-2012, 346-352.