ISCA Workshop on Multilingual Speech and Language Processing (MULTILING 2006)

Center for Language and Speech Technology, Stellenbosch University, Stellenbosch, South Africa
April 9-11, 2006

Character Stream Parsing of Mixed-lingual Text

Harald Romsdorfer, Beat Pfister

Speech Processing Group, Computer Engineering and Networks Laboratory, ETH Zurich, Switzerland

In multilingual countries text-to-speech synthesis systems often have to deal with sentences containing inclusions of multiple other languages in form of phrases, words or even parts of words. Such sentences can only be correctly processed using a system that incorporates a mixed-lingual morphological and syntactic analyzer. A prerequisite for such an analyzer is the correct identification of word and sentence boundaries. Traditional text analysis applies to both problems simple heuristic methods within a text preprocessing step. These methods, however, are not reliable enough for analyzing mixed-lingual sentences.

This paper presents a new approach towards word and sentence boundary identification for mixed-lingual sentences that bases upon parsing of character streams. Additionally this approach can also be used for word identification in languages without a designated word boundary symbol like Chinese or Japanese. To date, this mixed-lingual text analysis supports any mixture of English, French, German, Italian and Spanish.

Full Paper

Bibliographic reference.  Romsdorfer, Harald / Pfister, Beat (2006): "Character stream parsing of mixed-lingual text", In MULTILING-2006, paper 021.