In Singapore and Malaysia, people often speak a mix of Mandarin and English with a single sentence, that we call intra-sentential code-switch sentence. In this paper, we report the development of a Mandarin-English code-switching spontaneous speech corpus: SEAME. As part of a multilingual speech recognition project, the design of such a corpus allows the study of how Mandarin-English code-switch speech occurs in the spoken language in South-East Asia, and provides insights into the development of large vocabulary continuous speech recognition (LVCSR) to cover code-switching speech. We develop a speech corpus of intra-sentential code-switching utterances that are recorded under both interview and conversational settings. The paper describes the corpus design and the analysis of collected corpus.
Bibliographic reference. Lyu, Dau-Cheng / Tan, Tien-Ping / Chng, Eng Siong / Li, Haizhou (2010): "SEAME: a Mandarin-English code-switching speech corpus in south-east asia", In INTERSPEECH-2010, 1986-1989.