We explore a new approach to collecting and transcribing speech data by using online educational games. One such game, Voice Race, elicited over 55,000 utterances over a 22 day period, representing 18.7 hours of speech. Voice Race was designed such that the transcripts for a significant subset of utterances can be automatically inferred using the contextual constraints of the game. Game context can also be used to simplify transcription to a multiple choice task, which can be performed by non-experts. We found that one third of the speech collected with Voice Race could be automatically transcribed with over 98% accuracy; and that an additional 49% could be labeled cheaply by Amazon Mechanical Turk workers. We demonstrate the utility of the self-labeled speech in an acoustic model adaptation task, which resulted in a reduction in the Voice Race utterance error rate. The collected utterances cover a wide variety of vocabulary, and should be useful across a range of research.
Cite as: McGraw, I., Gruenstein, A., Sutherland, A. (2009) A self-labeling speech corpus: collecting spoken words with an online educational game. Proc. Interspeech 2009, 3031-3034, doi: 10.21437/Interspeech.2009-561
@inproceedings{mcgraw09_interspeech, author={Ian McGraw and Alexander Gruenstein and Andrew Sutherland}, title={{A self-labeling speech corpus: collecting spoken words with an online educational game}}, year=2009, booktitle={Proc. Interspeech 2009}, pages={3031--3034}, doi={10.21437/Interspeech.2009-561} }