14thAnnual Conference of the International Speech Communication Association

Lyon, France
August 25-29, 2013

Cross-Domain Paraphrasing for Improving Language Modelling Using Out-of-Domain Data

X. Liu, M. J. F. Gales, P. C. Woodland

University of Cambridge, UK

In natural languages the variability in the underlying linguistic generation rules significantly alters the observed surface word sequence they create, and thus introduces a mismatch against other data generated via alternative realizations associated with, for example, a different domain. Hence, direct modelling of out-ofdomain data can result in poor generalization to the in-domain data of interest. To handle this problem, this paper investigated using cross-domain paraphrastic language models to improve in-domain language modelling (LM) using out-of-domain data. Phrase level paraphrase models learnt from each domain were used to generate paraphrase variants for the data of other domains. These were used to both improve the context coverage of in-domain data, and reduce the domain mismatch of the out-of-domain data. Significant error rate reduction of 0.6% absolute was obtained on a stateof- the-art conversational telephone speech recognition task using a cross-domain paraphrastic multi-level LM trained on a billion words of mixed conversational and broadcast news data. Consistent improvements on the in-domain data context coverage were also obtained.

Full Paper

Bibliographic reference.  Liu, X. / Gales, M. J. F. / Woodland, P. C. (2013): "Cross-domain paraphrasing for improving language modelling using out-of-domain data", In INTERSPEECH-2013, 3424-3428.