12th Annual Conference of the International Speech Communication Association

Florence, Italy
August 27-31. 2011

Training a Language Model Using Webdata for Large Vocabulary Japanese Spontaneous Speech Recognition

Ryo Masumura, Seongjun Hahm, Akinori Ito

Tohoku University, Japan

This paper describes a language modeling method using large-scale spoken language data retrieved from the Web for spontaneous speech recognition. We downloaded 15 million Web pages on a comprehensive range topics. Next, spoken language-like texts were selected from the downloaded Web data using the naive Bayes classifier, and typical linguistic phenomena such as fillers and pauses were added using simulation models. A language model trained by the generated data gave as high performance as the large-scale spontaneous speech corpus (Corpus of Spontaneous Japanese, CSJ). By combining the generated data and CSJ, we improved word accuracy.

Full Paper

Bibliographic reference.  Masumura, Ryo / Hahm, Seongjun / Ito, Akinori (2011): "Training a language model using webdata for large vocabulary Japanese spontaneous speech recognition", In INTERSPEECH-2011, 1465-1468.