Jee haan, I’d like both, por favor: Elicitation of a Code-Switched Corpus of Hindi–English and Spanish–English Human–Machine Dialog

Vikram Ramanarayanan, David Suendermann-Oeft


We present a database of code-switched conversational human–machine dialog in English–Hindi and English–Spanish. We leveraged HALEF, an open-source standards-compliant cloud-based dialog system to capture audio and video of bilingual crowd workers as they interacted with the system. We designed conversational items with intra-sentential code-switched machine prompts, and examine its efficacy in eliciting code-switched speech in a total of over 700 dialogs. We analyze various characteristics of the code-switched corpus and discuss some considerations that should be taken into account while collecting and processing such data. Such a database can be leveraged for a wide range of potential applications, including automated processing, recognition and understanding of code-switched speech and language learning applications for new language learners.


 DOI: 10.21437/Interspeech.2017-1198

Cite as: Ramanarayanan, V., Suendermann-Oeft, D. (2017) Jee haan, I’d like both, por favor: Elicitation of a Code-Switched Corpus of Hindi–English and Spanish–English Human–Machine Dialog. Proc. Interspeech 2017, 47-51, DOI: 10.21437/Interspeech.2017-1198.


@inproceedings{Ramanarayanan2017,
  author={Vikram Ramanarayanan and David Suendermann-Oeft},
  title={ Jee haan, I’d like both, por favor: Elicitation of a Code-Switched Corpus of Hindi–English and Spanish–English Human–Machine Dialog},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={47--51},
  doi={10.21437/Interspeech.2017-1198},
  url={http://dx.doi.org/10.21437/Interspeech.2017-1198}
}