Interspeech'2005 - Eurospeech

Lisbon, Portugal
September 4-8, 2005

Development of a Cantonese-English Code-Mixing Speech Corpus

Joyce Y. C. Chan, P. C. Ching, Tan Lee

Chinese University of Hong Kong, China

This paper describes the design and compilation of the CUMIX Cantonese-English code-mixing speech corpus. Code-mixing is a common phenomenon in many bilingual societies and it usually involves at least two different languages within one utterance. In Hong Kong, people usually mix English words and phrases with Cantonese in their daily conversation. Although there are many monolingual corpora of Cantonese and English, code-mixing speech database of these two languages is not available. The aim of developing this corpus is to study of the effect of Cantonese accents in English, the design of effective language boundary detection algorithm in code-mixing utterances [1], and evaluation of the performance of code-mixing speech recognizers.

Full Paper

Bibliographic reference.  Chan, Joyce Y. C. / Ching, P. C. / Lee, Tan (2005): "Development of a Cantonese-English code-mixing speech corpus", In INTERSPEECH-2005, 1533-1536.