We present a database of code-switched conversational human–machine dialog in English–Hindi and English–Spanish. We leveraged HALEF, an open-source standards-compliant cloud-based dialog system to capture audio and video of bilingual crowd workers as they interacted with the system. We designed conversational items with intra-sentential code-switched machine prompts, and examine its efficacy in eliciting code-switched speech in a total of over 700 dialogs. We analyze various characteristics of the code-switched corpus and discuss some considerations that should be taken into account while collecting and processing such data. Such a database can be leveraged for a wide range of potential applications, including automated processing, recognition and understanding of code-switched speech and language learning applications for new language learners.
Cite as: Ramanarayanan, V., Suendermann-Oeft, D. (2017) Jee haan, I’d like both, por favor: Elicitation of a Code-Switched Corpus of Hindi–English and Spanish–English Human–Machine Dialog. Proc. Interspeech 2017, 47-51, doi: 10.21437/Interspeech.2017-1198
@inproceedings{ramanarayanan17_interspeech, author={Vikram Ramanarayanan and David Suendermann-Oeft}, title={{ Jee haan, I’d like both, por favor: Elicitation of a Code-Switched Corpus of Hindi–English and Spanish–English Human–Machine Dialog}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={47--51}, doi={10.21437/Interspeech.2017-1198} }