In this work, we propose an end-to-end approach to the language identification (LID) problem based on Convolutional Deep Neural Networks (CDNNs). The use of CDNNs is mainly motivated by the ability they have shown when modeling speech signals, and their relatively low-cost with respect to other deep architectures in terms of number of free parameters. We evaluate different configurations in a subset of 8 languages within the NIST Language Recognition Evaluation 2009 Voice of America (VOA) dataset, for the task of short test durations (segments up to 3 seconds of speech). The proposed CDNN-based systems achieve comparable performances to our baseline i-vector system, while reducing drastically the number of parameters to tune (at least 100 times fewer parameters). Then, we combine these CDNN-based systems and the i-vector baseline with a simple fusion at score level. This combination outperforms our best standalone system (up to 11% of relative improvement in terms of EER).
Bibliographic reference. Lozano-Diez, Alicia / Zazo-Candil, Ruben / Gonzalez-Dominguez, Javier / Toledano, Doroteo T. / Gonzalez-Rodriguez, Joaquin (2015): "An end-to-end approach to language identification in short utterances using convolutional neural networks", In INTERSPEECH-2015, 403-407.