We propose a framework to automatically construct a collection of high-resolution (HR) language-universal units for spoken language identification (LID). Based on the popular phone recognition language modeling (PRLM) approach to LID, a set of universal attribute recognizers (UARs) is first established to replace phone recognizers (PRs) using manner and place of articulation as attribute units and context-dependent (CD) attribute models are then built to achieve high-performance attribute transcription. To alleviate the difficulty of data sparsity in n-gram language modeling (LM) of these CD units, a clustering algorithm is proposed to compact the number of utilized attribute units in LM. Tested on the 2009 National Institute of Standards and Technology Language Recognition Evaluation for the 30-sec task using the same English Switchboard-I training data for acoustic modeling, our proposed approach achieves an equal error rate (EER) of 2.34%, representing a relative EER reduction of over 20% from the results of 2.88% obtained with the conventional PRLM techniques. To the best of our knowledge, this is the first time a single UAR based LID system significantly outperforms a signal PR based system with the same set of training data from a single language.
Bibliographic reference. Wang, Yannan / Du, Jun / Dai, Li-Rong / Lee, Chin-Hui (2015): "High-resolution acoustic modeling and compact language modeling of language-universal speech attributes for spoken language identification", In INTERSPEECH-2015, 992-996.