8th Annual Conference of the International Speech Communication Association

Antwerp, Belgium
August 27-31, 2007

Language Identification Based on n-Gram Frequency Ranking

R. Cordoba, L. F. D'Haro, F. Fernandez-Martinez, J. Macias-Guarasa, J. Ferreiros

Universidad Politécnica de Madrid, Spain

We present a novel approach for language identification based on a text categorization technique, namely an n-gram frequency ranking. We use a Parallel phone recognizer, the same as in PPRLM, but instead of the language model, we create a ranking with the most frequent n-grams, keeping only a fraction of them. Then we compute the distance between the input sentence ranking and each language ranking, based on the difference in relative positions for each n-gram. The objective of this ranking is to be able to model reliably a longer span than PPRLM, namely 5-gram instead of trigram, because this ranking will need less training data for a reliable estimation. We demonstrate that this approach overcomes PPRLM (6% relative improvement) due to the inclusion of 4-gram and 5-gram in the classifier. We present two alternatives: ranking with absolute values for the number of occurrences and ranking with discriminative values (11% relative improvement).

Full Paper

Bibliographic reference.  Cordoba, R. / D'Haro, L. F. / Fernandez-Martinez, F. / Macias-Guarasa, J. / Ferreiros, J. (2007): "Language identification based on n-gram frequency ranking", In INTERSPEECH-2007, 354-357.