Exploring Transformer-based Language Recognition using Phonotactic Information

David Romero, Luis Fernando D'Haro, Christian Salamea

This paper describes an encoder-only approach based on the “Transformer architecture” applied to the language recognition (LRE) task using phonotactic information. Due to the use of one global set of phonemes to recognize all languages, the proposed system needs to overcome difficulties due to the overlapping and high co-occurrences of similar phone sequences across languages. To mitigate this issue, we propose a single transformer-based encoder trained for classification, where the attention mechanism and its capability of handling large sequences of phonemes help to find discriminative sequences of phonotactic units that contribute to correctly identify the language for short, mid and long audio segments. The proposed approach provides significant improvements, outperforming phonotactic-based RNNs and Glove-based i-Vectors architectures, getting a relative improvement of 5.5% and 38,5% respectively. Our experiments were carried out using phoneme sequences obtained by the “Allosaurus phoneme recognizer” applied to the Kalaka-3 Database. This dataset is challenging since the languages to identify are mostly similar (i.e. Iberian languages, e.g. Spanish, Galician, Catalan). We provide results using the Cavg metric proposed for Nist evaluations.

doi: 10.21437/IberSPEECH.2021-53

Romero, D, D'Haro, L.F, Salamea, C (2021) Exploring Transformer-based Language Recognition using Phonotactic Information. Proc. IberSPEECH 2021, 250-254, doi: 10.21437/IberSPEECH.2021-53.