ISCA Archive Interspeech 2005
ISCA Archive Interspeech 2005

Development of a Cantonese-English code-mixing speech corpus

Joyce Y. C. Chan, P. C. Ching, Tan Lee

This paper describes the design and compilation of the CUMIX Cantonese-English code-mixing speech corpus. Code-mixing is a common phenomenon in many bilingual societies and it usually involves at least two different languages within one utterance. In Hong Kong, people usually mix English words and phrases with Cantonese in their daily conversation. Although there are many monolingual corpora of Cantonese and English, code-mixing speech database of these two languages is not available. The aim of developing this corpus is to study of the effect of Cantonese accents in English, the design of effective language boundary detection algorithm in code-mixing utterances [1], and evaluation of the performance of code-mixing speech recognizers.

doi: 10.21437/Interspeech.2005-450

Cite as: Chan, J.Y.C., Ching, P.C., Lee, T. (2005) Development of a Cantonese-English code-mixing speech corpus. Proc. Interspeech 2005, 1533-1536, doi: 10.21437/Interspeech.2005-450

  author={Joyce Y. C. Chan and P. C. Ching and Tan Lee},
  title={{Development of a Cantonese-English code-mixing speech corpus}},
  booktitle={Proc. Interspeech 2005},