Korean is an agglutinative and highly inflective language with a severe phonological phenomenon and coarticulation effects, making the development of a large-vocabulary continuous speech recognition system (LVCSR) difficult. Choosing a Korean orthographic word-phrase (eojeol) as a basic recognition unit leads to high out-of-vocabulary (OOV) rates, whereas choosing an orthographic syllable (eumjeol) unit results in high acoustic confusability. To overcome these difficulties, we propose to construct the speech recognition task as a serial architecture composed of two independent parts. The first part is to perform a standard hidden Markov model (HMM)-based recognition of phonemic syllable units of the actual pronunciation (surface forms). In this way, one phonemic syllable corresponds to one possible pronunciation only. Thus, the lexicon dictionary and OOV rates can be kept small, while avoiding high acoustic confusability. Here, the Korean orthography of written transcription are not yet considered. In the second part, the system then transforms the phonemic syllable surface forms into the desirable orthography of a recognition unit, e.g., eumjeol or eojeol. To solve this task, a noisy-channel model is utilized, wherein the sequence of phonemic syllables is considered as “noisy” string, and the goal is to recover the “clean” string of Korean orthography. The entire process requires no linguistic knowledge, only annotated texts. The experiments were conducted on a Korean dictation database, where the best system could achieve 91.21% eumjeol accuracy and 71.30% eojeol accuracy.
Bibliographic reference. Sakti, Sakriani / Isotani, Ryosuke / Kawai, Hisashi / Nakamura, Satoshi (2010): "Utilizing a noisy-channel approach for Korean LVCSR", In INTERSPEECH-2010, 1513-1516.