Corpora, in their different forms for different purposes, have been the bases for modern natural language processing technology. Taiwanese (MinNan), as other language members in the Sino-Tibet family, has been marginalized due to many reasons. One of the consequences of this marginalization is that no standard written script exists, and thus collecting corpus for these languages has been extremely difficult. By (almost) arbitrarily selecting the hanlor written script (mixture of hanzi and roman characters), we are still facing the problem that only few people are capable of phonetically transcribing a given Taiwanese text. On the other hand, reading a Taiwanese text is easier due to the existence of many commonly used hanzi. By recording a persons reading of Taiwanese text, we use a continuous speech recognizer for Taiwanese to automatically transcribe the text, and end up with two kinds of corpora, one in text, one in speech. The accuracy of the automatic phonetic transcription is about 96.05% in syllable count. For marginalized languages, this automatic transcription can be very useful for corpus collection if proper error spotting scheme is implemented.
Cite as: Chiang, Y.-C., Yang, Z.-S., Lyu, R.-Y. (2000) Taiwanese corpus collection via continuous speech recognition tool. Proc. 6th International Conference on Spoken Language Processing (ICSLP 2000), vol. 2, 1031-1034, doi: 10.21437/ICSLP.2000-448
@inproceedings{chiang00_icslp, author={Yuang-Chin Chiang and Zhi-Siang Yang and Ren-Yuan Lyu}, title={{Taiwanese corpus collection via continuous speech recognition tool}}, year=2000, booktitle={Proc. 6th International Conference on Spoken Language Processing (ICSLP 2000)}, pages={vol. 2, 1031-1034}, doi={10.21437/ICSLP.2000-448} }