Sixth International Conference on Spoken Language Processing
(ICSLP 2000)

Beijing, China
October 16-20, 2000

Taiwanese Corpus Collection Via Continuous Speech Recognition Tool

Yuang-Chin Chiang (2), Zhi-Siang Yang (1), Ren-Yuan Lyu (1)

(1) Dept. of Electrical Engineering, Chang Gung University, Taoyuan, Taiwan
(2) Inst. of Statistics, Tsing Hua University, Hsin-chu, Taiwan

Corpora, in their different forms for different purposes, have been the bases for modern natural language processing technology. Taiwanese (MinNan), as other language members in the Sino-Tibet family, has been marginalized due to many reasons. One of the consequences of this marginalization is that no standard written script exists, and thus collecting corpus for these languages has been extremely difficult. By (almost) arbitrarily selecting the hanlor written script (mixture of hanzi and roman characters), we are still facing the problem that only few people are capable of phonetically transcribing a given Taiwanese text. On the other hand, reading a Taiwanese text is easier due to the existence of many commonly used hanzi. By recording a personís reading of Taiwanese text, we use a continuous speech recognizer for Taiwanese to automatically transcribe the text, and end up with two kinds of corpora, one in text, one in speech. The accuracy of the automatic phonetic transcription is about 96.05% in syllable count. For marginalized languages, this automatic transcription can be very useful for corpus collection if proper error spotting scheme is implemented.

