11th Annual Conference of the International Speech Communication Association

Makuhari, Chiba, Japan
September 26-30. 2010

SEAME: A Mandarin-English Code-Switching Speech Corpus in South-East Asia

Dau-Cheng Lyu (1), Tien-Ping Tan (2), Eng Siong Chng (1), Haizhou Li (3)

(1) Nanyang Technological University, Singapore
(2) Universiti Sains Malaysia, Malaysia
(3) A*STAR, Singapore

In Singapore and Malaysia, people often speak a mix of Mandarin and English with a single sentence, that we call intra-sentential code-switch sentence. In this paper, we report the development of a Mandarin-English code-switching spontaneous speech corpus: SEAME. As part of a multilingual speech recognition project, the design of such a corpus allows the study of how Mandarin-English code-switch speech occurs in the spoken language in South-East Asia, and provides insights into the development of large vocabulary continuous speech recognition (LVCSR) to cover code-switching speech. We develop a speech corpus of intra-sentential code-switching utterances that are recorded under both interview and conversational settings. The paper describes the corpus design and the analysis of collected corpus.

