Fourth International Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU-2014)
St. Petersburg, Russia
With the globalization more and more words from other languages come into a language without assimilation to the phonetic system of the new language. To economically build up lexical resources with automatic or semi-automatic methods, it is important to detect and treat them separately. Due to the strong increase of Anglicisms, especially from the IT domain, we developed features for their automatic detection and collected and annotated a German IT corpus to evaluate them. Furthermore we applied our methods to Afrikaans words from the NCHLT corpus and German words from the news domain. Combining features based on grapheme perplexity, grapheme-to-phoneme confidence, Google hits count as well as spell-checker dictionary and Wiktionary lookup reaches 75.44% fscore. Producing pronunciations for the words in our German IT corpus based on our methods resulted in 1.6% phoneme error rate to reference pronunciations, while applying exclusively German grapheme-to-phoneme rules for all words achieved 5.0%.
Index Terms: Foreign entity detection, lexical resources, pronunciation modeling, Anglicisms
Bibliographic reference. Leidig, Sebastian / Schlippe, Tim / Schultz, Tanja (2014): "Automatic detection of anglicisms for the pronunciation dictionary generation: a case study on our German IT corpus", In SLTU-2014, 207-214.