15th Annual Conference of the International Speech Communication Association

September 14-18, 2014

A Minimal-Resource Transliteration Framework for Vietnamese

Hoang Gia Ngo (1), Nancy F. Chen (2), Sunil Sivadas (2), Bin Ma (2), Haizhou Li (2)

(1) National University of Singapore, Singapore
(2) A*STAR, Singapore

Transliteration converts words in a source language (e.g., English) into phonetically equivalent words in a target language (e.g., Vietnamese). Transliteration is therefore used to handle out-of-vocabulary (OOV) words adopted from foreign languages in automatic speech recognition and keyword search systems. While statistical transliteration approaches have been widely adopted, they may not always be suitable for under-resourced languages, where training data is scarce. In this work, we present a rule-based Vietnamese transliteration framework suitable for spoken language applications with minimal linguistic resources. We show that the proposed system outperforms statistical baselines by up to 81.70% relative when there is limited training examples (94 word pairs). In addition, we investigate the trade-off between training corpus size and transliteration performance of different methods on two distinct corpora. We also show that the proposed model outperforms statistical baselines up to 36.76% relative in keyword search tasks.

Full Paper

Bibliographic reference.  Ngo, Hoang Gia / Chen, Nancy F. / Sivadas, Sunil / Ma, Bin / Li, Haizhou (2014): "A minimal-resource transliteration framework for vietnamese", In INTERSPEECH-2014, 1410-1414.