ISCA Archive Eurospeech 1999
ISCA Archive Eurospeech 1999

Multigrams for language identification

Stefan Harbeck, Uwe Ohler

In our paper we present two new approaches for language identification. Both of them are based on the use of so-called multigrams, an information theoretic based observation representation. In the first approach we use multigram models for phonotactic modeling of phoneme or codebook sequences. The multigram model can be used to segment the new observation into larger units (e.g. something like words) and calculates a probability for the best segmentation. In the second approach we build a fenon recognizer using the segments of the best segmentation of the training material as \words" inside the recognition vocabulary. On the OGI test corpus and on the NIST'95 evaluation corpus we got significant improvements with this second approach in comparison to the unsupervised codebook approach when discriminating between English and German utterances.


doi: 10.21437/Eurospeech.1999-97

Cite as: Harbeck, S., Ohler, U. (1999) Multigrams for language identification. Proc. 6th European Conference on Speech Communication and Technology (Eurospeech 1999), 375-378, doi: 10.21437/Eurospeech.1999-97

@inproceedings{harbeck99_eurospeech,
  author={Stefan Harbeck and Uwe Ohler},
  title={{Multigrams for language identification}},
  year=1999,
  booktitle={Proc. 6th European Conference on Speech Communication and Technology (Eurospeech 1999)},
  pages={375--378},
  doi={10.21437/Eurospeech.1999-97}
}