Over the past few decades, research in automatic speech recognition and automatic speaker recognition has been greatly facilitated by the sharing of large annotated speech databases such as those distributed by the Linguistic Data Consortium (LDC). Open sources, particularly web sites such as YouTube, contain vast and varied speech recordings in a variety of languages. However, these "open sources" for speech data are largely untapped as resources for speech research. In this paper, a project to collect, organize, and annotate a large group of this speech data is described. The data consists of approximately 30 hours of speech in each of three languages, English, Mandarin Chinese, and Russian. Each of 900 recordings has been orthographically transcribed at the sentence/phrase level by human listeners. Some of the issues related to working with this low quality, varied, noisy speech data in three languages are described.
Bibliographic reference. Zahorian, Stephen A. / Wu, Jiang / Karnjanadecha, Montri / SekharVootkuri, Chandra / Wong, Brian / Hwang, Andrew / Tokhtamyshev, Eldar (2011): "Open source multi-language audio database for spoken language processing applications", In INTERSPEECH-2011, 1493-1496.