Transliteration converts words in a source language (e.g., English) into phonetically equivalent words in a target language (e.g., Vietnamese). This conversion needs to take into account phonology of the target language, which are rules determining how phonemes can be organized. For example, a transliterated word in Vietnamese that begins with a consonant cluster is phonologically invalid. While statistical transliteration approaches have been widely adopted, most do not explicitly model the target language's phonology, and thus produce invalid outputs. The problem is compounded for low-resource languages where training data is scarce. In this work, we present a phonology-augmented statistical framework suitable for languages with minimal linguistic resources. We propose the concept of pseudo-syllables as structures representing how segments of a foreign word are arranged according to the target language's phonology. We use Vietnamese, a tonal language with monosyllabic structure as an example. We show that the proposed system outperforms the statistical baseline by up to 70.3% relative, when there are limited training examples (94 word pairs). We also investigate the trade-off between training corpus size and transliteration performance of different methods on two distinct corpora.
Bibliographic reference. Ngo, Hoang Gia / Chen, Nancy F. / Nguyen, Binh Minh / Ma, Bin / Li, Haizhou (2015): "Phonology-augmented statistical transliteration for low-resource languages", In INTERSPEECH-2015, 3670-3674.