Fourth International Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU-2014)

St. Petersburg, Russia
May 14-16, 2014

Features for Factored Language Models for Code-Switching Speech

Heike Adel (1,2), Katrin Kirchhoff (2), Dominic Telaar (1), Ngoc Thang Vu (1), Tim Schlippe (1), Tanja Schultz (1)

(1) Cognitive Systems Lab, Karlsruhe Institute of Technology (KIT), Germanyx
(2) Department of Electrical Engineering, University of Washington (UW), USA

This paper presents investigations of features which can be used to predict Code-Switching speech. For this task, factored language models are applied and implemented into a state-of-the-art decoder. Different possible factors, such as words, part-of-speech tags, Brown word clusters, open class words and open class word clusters are explored. We find that Brown word clusters, part-of-speech tags and open-class words are most effective at reducing the perplexity of factored language models on the Mandarin-English Code-Switching corpus SEAME. In decoding experiments, the model containing Brown word clusters and part-of-speech tags and the model also including open class word clusters yield the best mixed error rate results. In summary, the factored language models can reduce the perplexity on the SEAME evaluation set by up to 10.8% relative and the mixed error rate by up to 3.4% relative.

Index Terms: language modeling, factored language models, Code-Switching speech

Full Paper

Bibliographic reference.  Adel, Heike / Kirchhoff, Katrin / Telaar, Dominic / Vu, Ngoc Thang / Schlippe, Tim / Schultz, Tanja (2014): "Features for factored language models for code-Switching speech", In SLTU-2014, 32-38.