8th International Conference on Spoken Language Processing

Jeju Island, Korea
October 4-8, 2004

Construct a Multi-Lingual speech Corpus in Taiwan with Extracting Phonetically Balanced Articles

Min-siong Liang (1), Dau-cheng Lyu (1), Yuang-chin Chiang (2), Renyuan Lyu (1)

(1) Chang Gung University, Taiwan
(2) National Tsing Hua University, Taiwan

In this paper, we describe an initial stage to construct a multi-lingual speech corpus in Taiwan with selecting phonetically balanced scripts. It is expected to collect a multilingual speech corpus covering three most frequently used languages in Taiwan, including Taiwanese (Min-nan), Hakka, and Mandarin Chinese. To achieve the objective, constructing a multilingual phonetic alphabet, namely Formosa Phonetic Alphabet (ForPA), is the first step. In addition, the multilingual lexicons (Fomosa Lexicons) are also important parts for building the corpus. Recently, this corpus containing 2,300 speakers' speech database has been finished and is ready to be released. It contains about 200 hours of speech and 404,000 utterances.

Full Paper

Bibliographic reference.  Liang, Min-siong / Lyu, Dau-cheng / Chiang, Yuang-chin / Lyu, Renyuan (2004): "Construct a multi-lingual speech corpus in taiwan with extracting phonetically balanced articles", In INTERSPEECH-2004, 2737-2740.