ISCA Archive Interspeech 2013
ISCA Archive Interspeech 2013

Web data harvesting for speech understanding grammar induction

Ioannis Klasinas, Alexandros Potamianos, Elias Iosif, Spiros Georgiladakis, Gianluca Mameli

The development of a speech understanding grammar for spoken dialogue systems can be greatly accelerated by using an in-domain corpus. The development of such a corpus, however, is a slow and expensive process. This paper proposes unsupervised, languageagnostic methods for finding relevant corpora in the web and mining the most informative parts. We show that by utilizing perplexity we are able to increase the in-domainess (precision) of the mined corpora, while by utilizing pragmatic constraints and search engine rank we can increase the generalizability (recall). We show that automatic grammar induction algorithms achieve superior performance on the automatically mined corpora compared to in-domain manually collected corpora for a travel application.


doi: 10.21437/Interspeech.2013-627

Cite as: Klasinas, I., Potamianos, A., Iosif, E., Georgiladakis, S., Mameli, G. (2013) Web data harvesting for speech understanding grammar induction. Proc. Interspeech 2013, 2733-2737, doi: 10.21437/Interspeech.2013-627

@inproceedings{klasinas13_interspeech,
  author={Ioannis Klasinas and Alexandros Potamianos and Elias Iosif and Spiros Georgiladakis and Gianluca Mameli},
  title={{Web data harvesting for speech understanding grammar induction}},
  year=2013,
  booktitle={Proc. Interspeech 2013},
  pages={2733--2737},
  doi={10.21437/Interspeech.2013-627}
}