Odyssey 2010: The Speaker and Language Recognition Workshop
Brno, Czech Republic
This paper confirms the huge benefits of Factor Analysis over Maximum A-Posteriori adaptation for language recognition (up to 87% relative gain). We investigate ways to cope with the particularity of NIST's LRE 2009, containing Conversational Telephone Speech (CTS) and phone bandwidth segments of radio broadcasts (Voice Of America, VOA). We analyze GMM systems using all data pooled together, eigensession matrices estimated on a per condition basis and systems using a concatenation of these matrices. Results are presented on all LRE 2009 test segments, as well as only on the CTS or only on the VOA test utterances. Since performances on all 23 languages are not trivial to compare, due to lacking language-channel combinations in the training and also in the testing data, all systems are also evaluated in the context of the subset of 8 common languages. Addressing the question if a fusion of two channel specific systems may be more beneficial than putting all data together, we study an oracle based system selector. On the 8 language subset, a pure CTS system performs at a minimal average cost of 2.7% and pure VOA at 1.9% min-C_avg on their respective test conditions. The fusion of these two systems runs at 2.0% min-C_avg. As main observation, we see that the way we estimate the session compensation matrix has not a big influence, as long as the language-channel combinations cover those used for training the language models. Far more crucial is the kind of data used for model estimation.
Full Paper (PDF)
Bibliographic reference. Verdet, Florian / Matrouf, Driss / Bonastre, Jean-François / Hennebert, Jean (2010): "Coping with Two Different Transmission Channels in Language Recognition", In Odyssey-2010, paper 039.